BufferAlloc: don't take two simultaneous locks

Started by Yura Sokolovover 4 years ago71 messages
#1Yura Sokolov
y.sokolov@postgrespro.ru
5 attachment(s)

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has spare
space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v0-0001-bufmgr-do-not-acquire-two-partition-lo.patchtext/x-patch; charset=UTF-8; name=v0-0001-bufmgr-do-not-acquire-two-partition-lo.patchDownload
From a1606eaa124fc497763ed5e28e22cbc8f6443b33 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Wed, 22 Sep 2021 13:10:37 +0300
Subject: [PATCH v0] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

This change requires to manually return BufferDesc to free list.

Also insertion and deletion to dynahash is optimized by avoiding
unnecessary free list manipulations in common case (when buffer is
reused)

Also small and never triggered bug in hash_update_hash_key is fixed.
---
 src/backend/storage/buffer/buf_table.c |  54 +++--
 src/backend/storage/buffer/bufmgr.c    | 183 ++++++++--------
 src/backend/utils/hash/dynahash.c      | 289 +++++++++++++++++++++++--
 src/include/storage/buf_internals.h    |   6 +-
 src/include/utils/hsearch.h            |  17 ++
 5 files changed, 404 insertions(+), 145 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..05e1dc9dd29 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -107,36 +107,29 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
 
 /*
  * BufTableInsert
- *		Insert a hashtable entry for given tag and buffer ID,
- *		unless an entry already exists for that tag
- *
- * Returns -1 on successful insertion.  If a conflicting entry exists
- * already, returns the buffer ID in that entry.
+ *		Insert a hashtable entry for given tag and buffer ID.
+ *		Caller should be sure there is no conflicting entry.
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ * and call BufTableLookup to check for conflicting entry.
+ *
+ * If oldelem is passed it is reused.
  */
-int
-BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
+void
+BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id, void *oldelem)
 {
 	BufferLookupEnt *result;
-	bool		found;
 
 	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
 	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_ENTER,
-									&found);
-
-	if (found)					/* found something already in the table */
-		return result->id;
+		hash_insert_with_hash_nocheck(SharedBufHash,
+									  (void *) tagPtr,
+									  hashcode,
+									  oldelem);
 
 	result->id = buf_id;
-
-	return -1;
 }
 
 /*
@@ -144,19 +137,32 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  *		Delete the hashtable entry for given tag (which must exist)
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ *
+ * Returns pointer to internal hashtable entry that should be passed either
+ * to BufTableInsert or BufTableFreeDeleted.
  */
-void
+void *
 BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 {
 	BufferLookupEnt *result;
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_REMOVE,
-									NULL);
+		hash_delete_skip_freelist(SharedBufHash,
+								  (void *) tagPtr,
+								  hashcode);
 
 	if (!result)				/* shouldn't happen */
 		elog(ERROR, "shared buffer hash table corrupted");
+
+	return result;
+}
+
+/*
+ * BufTableFreeDeleted
+ *		Returns deleted hashtable entry to freelist.
+ */
+void
+BufTableFreeDeleted(void *oldelem, uint32 hashcode)
+{
+	hash_return_to_freelist(SharedBufHash, oldelem, hashcode);
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b0..9c23e54f7c5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1114,6 +1114,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *buf;
 	bool		valid;
 	uint32		buf_state;
+	void	   *oldElem = NULL;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -1288,93 +1289,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1391,31 +1315,102 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+	 * IO_IN_PROGRESS (flags that were not cleared in previous code).
+	 */
+	Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	if (oldFlags & BM_TAG_VALID)
+		oldElem = BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. Note that we have not yet
+	 * removed the hashtable entry for the old tag.
+	 */
+	buf_id = BufTableLookup(&newTag, newHash);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to use
+		 * and put it to free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+		if (oldElem != NULL)
+			BufTableFreeDeleted(oldElem, oldHash);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Okay, it's finally safe to rename the buffer.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1423,13 +1418,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
+	BufTableInsert(&newTag, newHash, buf->buf_id, oldElem);
 	LWLockRelease(newPartitionLock);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c79..ce5bba8e975 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -99,6 +99,7 @@
 #include "access/xact.h"
 #include "common/hashfn.h"
 #include "port/pg_bitutils.h"
+#include "port/atomics.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/dynahash.h"
@@ -133,6 +134,18 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+#if SIZEOF_LONG == 8
+typedef pg_atomic_uint64 Count;
+#define count_atomic_inc(x)	pg_atomic_fetch_add_u64((x), 1)
+#define count_atomic_dec(x)	pg_atomic_fetch_sub_u64((x), 1)
+#define MAX_NENTRIES	((uint64)PG_INT64_MAX)
+#else
+typedef pg_atomic_uint32 Count;
+#define count_atomic_inc(x)	pg_atomic_fetch_add_u32((x), 1)
+#define count_atomic_dec(x)	pg_atomic_fetch_sub_u32((x), 1)
+#define MAX_NENTRIES	((uint32)PG_INT32_MAX)
+#endif
+
 /*
  * Per-freelist data.
  *
@@ -153,7 +166,7 @@ typedef HASHBUCKET *HASHSEGMENT;
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	Count		nentries;		/* number of entries in associated buckets */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -306,6 +319,54 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+/*
+ * Free list routines
+ */
+static inline void
+free_list_link_entry(HASHHDR *hctl, HASHBUCKET currBucket, int freelist_idx)
+{
+	FreeListData *list = &hctl->freeList[freelist_idx];
+
+	if (IS_PARTITIONED(hctl))
+	{
+		SpinLockAcquire(&list->mutex);
+		currBucket->link = list->freeList;
+		list->freeList = currBucket;
+		SpinLockRelease(&list->mutex);
+	}
+	else
+	{
+		currBucket->link = list->freeList;
+		list->freeList = currBucket;
+	}
+}
+
+static inline void
+free_list_increment_nentries(HASHHDR *hctl, int freelist_idx)
+{
+	FreeListData *list = &hctl->freeList[freelist_idx];
+
+	/* Check for overflow */
+	Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);
+
+	if (IS_PARTITIONED(hctl))
+		count_atomic_inc(&list->nentries);
+	else
+		list->nentries.value++;
+}
+
+static inline void
+free_list_decrement_nentries(HASHHDR *hctl, int freelist_idx)
+{
+	FreeListData *list = &hctl->freeList[freelist_idx];
+
+	if (IS_PARTITIONED(hctl))
+		count_atomic_dec(&list->nentries);
+	else
+		list->nentries.value--;
+	/* Check for overflow */
+	Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);
+}
 
 /************************** CREATE ROUTINES **********************/
 
@@ -1000,7 +1061,7 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		if (hctl->freeList[0].nentries.value > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1057,23 +1118,14 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
-				if (IS_PARTITIONED(hctl))
-					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
-
 				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
+				free_list_decrement_nentries(hctl, freelist_idx);
 
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
-				currBucket->link = hctl->freeList[freelist_idx].freeList;
-				hctl->freeList[freelist_idx].freeList = currBucket;
-
-				if (IS_PARTITIONED(hctl))
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+				free_list_link_entry(hctl, currBucket, freelist_idx);
 
 				/*
 				 * better hope the caller is synchronizing access to this
@@ -1115,6 +1167,7 @@ hash_search_with_hash_value(HTAB *hashp,
 							(errcode(ERRCODE_OUT_OF_MEMORY),
 							 errmsg("out of memory")));
 			}
+			free_list_increment_nentries(hctl, freelist_idx);
 
 			/* link into hashbucket chain */
 			*prevBucketPtr = currBucket;
@@ -1165,6 +1218,7 @@ hash_update_hash_key(HTAB *hashp,
 {
 	HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
 	HASHHDR    *hctl = hashp->hctl;
+	uint32		oldhashvalue;
 	uint32		newhashvalue;
 	Size		keysize;
 	uint32		bucket;
@@ -1218,6 +1272,7 @@ hash_update_hash_key(HTAB *hashp,
 			 hashp->tabname);
 
 	oldPrevPtr = prevBucketPtr;
+	oldhashvalue = existingElement->hashvalue;
 
 	/*
 	 * Now perform the equivalent of a HASH_ENTER operation to locate the hash
@@ -1271,12 +1326,21 @@ hash_update_hash_key(HTAB *hashp,
 	 */
 	if (bucket != newbucket)
 	{
+		int			old_freelist_idx = FREELIST_IDX(hctl, oldhashvalue);
+		int			new_freelist_idx = FREELIST_IDX(hctl, newhashvalue);
+
 		/* OK to remove record from old hash bucket's chain. */
 		*oldPrevPtr = currBucket->link;
 
 		/* link into new hashbucket chain */
 		*prevBucketPtr = currBucket;
 		currBucket->link = NULL;
+
+		if (old_freelist_idx != new_freelist_idx)
+		{
+			free_list_decrement_nentries(hctl, old_freelist_idx);
+			free_list_increment_nentries(hctl, new_freelist_idx);
+		}
 	}
 
 	/* copy new key into record */
@@ -1288,6 +1352,193 @@ hash_update_hash_key(HTAB *hashp,
 	return true;
 }
 
+/*
+ * hash_insert_with_hash_nocheck - inserts new entry into bucket without
+ * checking for duplicates.
+ *
+ * Caller should be sure there is no conflicting entry.
+ *
+ * Caller may pass pointer to old entry acquired with hash_delete_skip_freelist.
+ * In this case entry will be reused and returned as a new.
+ */
+void *
+hash_insert_with_hash_nocheck(HTAB *hashp,
+							  const void *keyPtr,
+							  uint32 hashvalue,
+							  void *oldentry)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
+	uint32		bucket;
+	long		segment_num;
+	long		segment_ndx;
+	HASHSEGMENT segp;
+	HASHBUCKET	currBucket;
+	HASHBUCKET *prevBucketPtr;
+
+#if HASH_STATISTICS
+	hash_accesses++;
+	hctl->accesses++;
+#endif
+
+	/* disallow updates if frozen */
+	if (hashp->frozen)
+		elog(ERROR, "cannot update in frozen hashtable \"%s\"",
+			 hashp->tabname);
+
+	if (!IS_PARTITIONED(hctl) &&
+		hctl->freeList[0].nentries.value > (long) hctl->max_bucket &&
+		!has_seq_scans(hashp))
+		(void) expand_table(hashp);
+
+	/*
+	 * Lookup the existing element using its saved hash value.  We need to do
+	 * this to be able to unlink it from its hash chain, but as a side benefit
+	 * we can verify the validity of the passed existingEntry pointer.
+	 */
+	bucket = calc_bucket(hctl, hashvalue);
+
+	segment_num = bucket >> hashp->sshift;
+	segment_ndx = MOD(bucket, hashp->ssize);
+
+	segp = hashp->dir[segment_num];
+
+	if (segp == NULL)
+		hash_corrupted(hashp);
+
+	prevBucketPtr = &segp[segment_ndx];
+
+	if (oldentry != NULL)
+		currBucket = ELEMENT_FROM_KEY(oldentry);
+	else
+		currBucket = get_hash_entry(hashp, freelist_idx);
+	free_list_increment_nentries(hctl, freelist_idx);
+
+	if (currBucket == NULL)
+	{
+		/* report a generic message */
+		if (hashp->isshared)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of shared memory")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory")));
+	}
+
+	/* copy key into record */
+	currBucket->hashvalue = hashvalue;
+	currBucket->link = *prevBucketPtr;
+	hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, hashp->keysize);
+
+	*prevBucketPtr = currBucket;
+
+	return (void *) ELEMENTKEY(currBucket);
+}
+
+/*
+ * hash_delete_skip_freelist - find and delete entry, but don't put it
+ * to free list.
+ *
+ * Used in Buffer Manager to reuse entry for evicted buffer.
+ *
+ * Returned entry should be either reused with hash_insert_with_hash_nocheck
+ * or returned to free list with hash_return_to_freelist.
+ */
+void *
+hash_delete_skip_freelist(HTAB *hashp,
+						  const void *keyPtr,
+						  uint32 hashvalue)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
+	Size		keysize;
+	uint32		bucket;
+	long		segment_num;
+	long		segment_ndx;
+	HASHSEGMENT segp;
+	HASHBUCKET	currBucket;
+	HASHBUCKET *prevBucketPtr;
+	HashCompareFunc match;
+
+#if HASH_STATISTICS
+	hash_accesses++;
+	hctl->accesses++;
+#endif
+
+	/*
+	 * Do the initial lookup
+	 */
+	bucket = calc_bucket(hctl, hashvalue);
+
+	segment_num = bucket >> hashp->sshift;
+	segment_ndx = MOD(bucket, hashp->ssize);
+
+	segp = hashp->dir[segment_num];
+
+	if (segp == NULL)
+		hash_corrupted(hashp);
+
+	prevBucketPtr = &segp[segment_ndx];
+	currBucket = *prevBucketPtr;
+
+	/*
+	 * Follow collision chain looking for matching key
+	 */
+	match = hashp->match;		/* save one fetch in inner loop */
+	keysize = hashp->keysize;	/* ditto */
+
+	while (currBucket != NULL)
+	{
+		if (currBucket->hashvalue == hashvalue &&
+			match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+			break;
+		prevBucketPtr = &(currBucket->link);
+		currBucket = *prevBucketPtr;
+#if HASH_STATISTICS
+		hash_collisions++;
+		hctl->collisions++;
+#endif
+	}
+
+	if (currBucket == NULL)
+		return NULL;
+
+	/* delete the record from the appropriate nentries counter. */
+	free_list_decrement_nentries(hctl, freelist_idx);
+
+	/* remove record from hash bucket's chain. */
+	*prevBucketPtr = currBucket->link;
+
+	return (void *) ELEMENTKEY(currBucket);
+}
+
+/*
+ * hash_return_to_freelist - return entry deleted with
+ * hash_delete_skip_freelist to free list.
+ *
+ * Used in Buffer Manager in case new conflicting entry were inserted by
+ * concurrent process.
+ */
+void
+hash_return_to_freelist(HTAB *hashp,
+						const void *entry,
+						uint32 hashvalue)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
+	HASHBUCKET	currBucket;
+
+	if (entry == NULL)
+		return;
+
+	currBucket = ELEMENT_FROM_KEY(entry);
+
+	/* add the record to the appropriate freelist. */
+	free_list_link_entry(hctl, currBucket, freelist_idx);
+}
+
 /*
  * Allocate a new hashtable entry if possible; return NULL if out of memory.
  * (Or, if the underlying space allocator throws error for out-of-memory,
@@ -1349,11 +1600,6 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1611,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist */
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1627,7 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = hashp->hctl->freeList[0].nentries.value;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1637,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum += hashp->hctl->freeList[i].nentries.value;
 	}
 
 	return sum;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a8..4a1d6b37821 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,10 @@ extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
-extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id,
+						   void *oldelem);
+extern void *BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableFreeDeleted(void *oldelem, uint32 hashcode);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index d7af0239c8c..1d586ef1169 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -150,4 +150,21 @@ extern Size hash_get_shared_size(HASHCTL *info, int flags);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
+/*
+ * Buffer Manager optimization utilities.
+ * They made to avoid taking two partition locks simultaneously and
+ * skip interraction with dynahash's freelist.
+ * Use them carefully and only if they gives meaningful improvement.
+ */
+extern void *hash_insert_with_hash_nocheck(HTAB *hashp,
+										   const void *keyPtr,
+										   uint32 hashvalue,
+										   void *oldentry);
+extern void *hash_delete_skip_freelist(HTAB *hashp,
+									   const void *keyPtr,
+									   uint32 hashvalue);
+extern void hash_return_to_freelist(HTAB *hashp,
+									const void *entry,
+									uint32 hashvalue);
+
 #endif							/* HSEARCH_H */
-- 
2.33.0

1socket.gifimage/gif; name=1socket.gifDownload
GIF89aX��###+++444<<<CCCLLLSSS[[[bbbkkkrrrzzz������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��{	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L��a��XL� ������� ``����
R���#�Q�v������=5@0��g���������R��^�y���g��`@@�:?��`@~�<���v���G�6�� #��@{�����]��r�d�2����@�gA������b��`y�%d���-�KC,'Arb8)\�}�������r����F@�`�@V�F����9���A�r�b���-��B�<���\M��v��H`4`�b�1��4�b�m4
Z�S~����#)�D@&��-.�@A�=@�<�(r��R��rF���H
��J:������R���e7
���x��T���F
����X�
���q�b���AsPF��oz�����9��%���@fN��`TA��T_��9�C�@��ux�z��',�"��x��@n��d�.'A��9�����������]�����p���x78��o ��
4��a�������m��$�!��D�Z��HK���=Y�M�@k@F5|;
n�i �P*��5���bk�2Pd�n@1�h>��ko
�<�.��F�}0�B
��8`o�)o}m����I)���%`���%@���������@�y���7��������b#A�!D��y�D2
�����,��9���z�bpo�
��rod�sP
s��I�bG�K`���~��
���	����/v���������S.��%��k�*���v/6�.l���4�R���������Jp�v����q�3���h	���(�5���������L�k2�����9� �bA���g9�<��@��0�9�{� "#�
dZQ�n
r>��p6���r��A����}FA�>"�U��d
0����[�'3
�}�l0�5����d��5�5��`��l
`\�!Q;H��T�N~2[�/�L�ld�*&��e� o��@���� 	�s[��P�u�XS�P��D�
b
�����
� h�$d0�I9:��c��u����I�O�
�N
��U�!E$�&w��k�eA�$=�(�S	f��6��2�T�0	r�=F�4���"E�k9(V��2@���k"4�${q��9�����9���WD	2�������M���n R�d�%��
�
��2�E4���R�9�AP���=O!T\�@"v!�5����#pP|m!��"��4
d ��C�R%��`���@F��}�3������� "$�B�J�3<��O
��_���4���,������$Z�6����<����`��A
$QZ%n����v�����"IQ��Q�����a0����A��P�:w�J�MB��I�Qk�����c	@V0�63s��rM�
9�j�5�gM.��sj/P���=�A�e&I���f��yR����x���p�{���n�������M��Daw�wQ �bXP���
drNC�����,�����)�v����xf�J��o����9KD@x�L-��^YJ������Z������)��^b��@��^�@����6�Y�\����>���%�#wh�O=��Zj
%pCI@�P���8����rr���.���@����b�����9q�x��^x5�(*���k@�
��.O,]-���"�U����6���9;Me���u6`�o�����=���l����L��L�|�Z����g��8P�o��u��M��f5�?�M2��n���9
�E~��P<;���yw-i�� ��M�
�o�%6q�%����=����_�t�\�p6#�pg"�P�����+@��\@�1�&|k�pGB�u/�*������]�j(�YW]�C8�'��a6�*3@x.����dASfh�o��"�E4vu��v9 0�JE���������O�����O�����;������'O��[������7���{�����GO������O��W����������gO���������w��[hB��=��O�N��$��;E\���;��yx�d���+$5H����;��
�����/:���gx������~�#�O��C��������l!�G��]:�W]��
��x|&!g�
*0�X��
�	����"8�$h���0��X���C0�4X�68����������I��Gx���%��!�����������
!$��#xT��Dr�&qDJ�7��4���7a������if8V��z��H���f��QCf4
�jW��w7AT/������
�Q�
P����k@�P�1
`�!�h��4�:�0:6�C�P ���\��&��cj �x��x���0"�8�@�u�8P`5��gftd�l�����x�x��8c��2����������1
,��rx��_�3k���(z{��
x}�Jp�C2�a�4
�f���r�	p�f��&P
�	���x�	A0V��d7i��R��`��@��w�V�&����0����}
���H������x�	y[�����}����=���� ��� i(���%�s�&i.I�[Yb4Iz�w��m����+$�@���t��eyz��
��Uh$6�2`$ �A
�� �x<��*��4��"9���Ac)��G��zc.�#-8����0*Vg�	�II����C��B��y���1���t<����%��%��gy�i�4��_����� �Y'��U���������_������y���]��;C�5>�>)�����XHV������R8+�	�rb����t�sT!��4��H��!��	���9:�����S�C�d��5�a8�
'�g	5�����Z� �*�z�8�}0
����Y�7��	���7��i�Kj�kv"� 3�� ���=b$�����o�4 ����Q��v ��@�`�Hz<��J���+p�g� �����"j�Lj3'�jdZ~*%��Lu<l�n
�����_�0���q�6���uYcj�J��*���C�������1A�	C���f��y\
�^:1`�b'"�[t�q'�*J���S�Q�v���9������>��Q"�O�4��J��o�j���$�R�je����i�I(���z�	����Ob3Wf��mX��y[������@�b��I��@���^FI��f<��y��������Rr�0k �8
�Y�>:�)Sf��a���v���g��k@J�r�V�5y�f�9���A0�P��I��~�5p"H����p4k�09kU�>[���.X
�%[5>��i�<f�G��J::��v)��9��A��Z;7u�t��x�����)�����03��(D���.CW[)��(�?��(���w'��F�	`�S��8�(@)<��J��C�	��Z\���Z���������0�X2�Hq����G [)�I
�t
�
�����a�1�3��K�����k���8����+�D��[����������	x�>���I��E'�~ *�]����8��!\{'��F�	J�Ck�:�\8�[������{3Q�G�=!�#i���A��[2\��b�7��|����L!�J�E��=��Y(,�>|P<�&hSL�=q�0"f�g(G�pZ)m�p�j�l�q���Z)z�p��l���}��b<��\��|����������������<��\��|����������������<��\��|�������������������<�~��	�|U@�Up���I@�P�3��\�@�!	���\��
���=�PDPq��|�P�Ip�	��Q��Z���6h	�l�16��5����3(����1	6�����
�,�l��l��<Tp�-����l]��@k�}�!-����6X����	1�5�l��k8�uWw������`�7��P�� 	Hm	Jm	���@���|��0<M�;��n\)l��:��q��0]���>]�?��O��K}�
�>=�W����
�8�l�.�����m �mP���A��}]p���J
	���2<A06�I0TP�=��������]�M�B���M��M���{���m��=O@�]�������-����B-����k�����gF-	� 	K������Q��(Q&FD�g�5��K�5���$zux$@�$���
�Po�\��-����7��7����E=�#=M�'�|�S`�z=l���38�A�7�n|�T=�=���J���`������������(q34r0
��&j�c�-v�.n��$Z:xu�"0<0!���`r�& !("�����`p�KP#��N>%�Ms��0�kX�{�G������+�=�y
���3x�l>����y��
���$>��|04RC�B�3
J��b�c;.xl06��#H'�t	����������Q�������LP"@�$@&p'�l�m�oP~uPvp�`
��78�@m�A����`����5x��54H�>v��78�Q�5H�����0S,&@�c+��b53�pP�=x�`
�~�'� � ��	�pI�p��>�58�0�����p�^��o���'���=@ ��>��l�u0	��
0L� �L��@=.: �"p���$m����n�����L�G_�8]�>�����0p�Gr8kpN����&�r�)�S��-M��P!/'@'�������C �A�>�A �=�����p
�J>�#hL�t�M��\(��
N�����P�nu�
�'��?Ao�#���K�jG��4�N����`
m��$��`�aM�I�(a�Q<������p���M3��N~����m�t@��@�H�O����#��������������B������.i/6�����QA�@�
D�P�B�
>4ha�D�;
^����� @���"�)���2e�:�,��eK`V`9s��
|�����f����F�
�iPi��������M�]�~�K�X�e��E����@A�0���^u4�Ew��c+,H�X�b��?�yq�'�Y2B��TA�z�+��W�'���Zu�\���Z�kj.������+*Sv��as���$LL�pJ��������N�;��_��];���^|c���8��y�*����O�N�:�I(5�xW�T�o� ��$�8��H�9�v�.�T����&��=3*�0C
�*o�8T�$�!�@mCS��6�X���f�/5�����]n�-�Tl�J��2��6���S������.)���D)����*)3�+��21T�������/���S,w���N(!�� �rT�l�,�rC5St��KA%�P�N1�Ce�12[b2�=u�P26�:�9�3a8�D0�6����%�O��T��I�l4VYgk,��!OZwt7-����%�?����S&���3{�Q���S�T��SQ��]y7\��b�T�E7]�`�4)6R����v3�M�`�S9D�� KE)�K��s�E8a���Ea���8M���(nl�S���M`�
:����^,����R�S�!�yVDc������)R��U�?T����
6:p�6�49�����i*���i��V��7�`��bU�q*���GQ�����cI���r�_��;����`�v�4+�ze���X��p�U�r�Ig�2&�z�p������/�\�].m	c�R�:s�"��OG��g[:!u��$�u�gO������8���w��� 	�x���}��g�y���������z����z���^�!��>|�����
U%�Bt!�}�_��P]\P�~��N�J*a�:>��A+�c����)�
��������X��$8��~��S��a5��"T#�@
�YE	O�?`���f��2F���u����G^��h��V��~.HMK��
Q5<�*�X	����;�da����`j�g?��:���+���V.<�� ��]�N4����B��XE+\�]��e�cZZtH��8$d$%)]P�����,��BB��7:�!�3J����H�L"���5�����5�>��]l 37�}�N"Zl���C�� t���J ���C�p���x���yMlvs����)q���f:���p�������GE(��XE ]��N���*#d
�D\B�F(dY�V�E3�"vr�Kt�"vd�E-��081���#��
Z�%:��Pt��W�B���ZD�x�.>)��|����(S����N+9Y�B��"A��>HR�5e`�m�����;m�O����lE?�FW8q��e	o )#�a5x8"(��.[��X��0������2������E
�`�N����bJMh4-�Ded����V���!(j_)�����%/&���"�%��4�Z��.�{���iD�O�| %�I<6t,���D�S2�5\A��Xl�X�y-�Ef���+��F�Ajh����g1l'4��/tv����7����Q��)j�\����o��)b	��-15�p""��������U� �j��2�*�)�lx��PKXWD6���0�h]�Z)o,Yf-�K������C!T!�]�w����� ���B�1�&X�.�\��jZ!0p;����WK<�p5����T5��_i#�'�P��m�l	I�V����P�3�lTk�S�6(1�ig�*zd�c`A8� �B��5���<_(C���D��p_�����5��?�:85xUaj�l?W(R��p��qP#���
B���GX������'T����vL��	M���i�E���k�(�~7h�X�(�c7�������WPn��{�v"(�c�-E;�b��-���2pz����'jq{w��~�"6lfb���9%�u���E��s�(��d�xR��������������vtA����3�o���'���bN����q���e�-n�r������9�v���'}K@x�m �������i����������h��g]����E�i��%��/`B�1%yjt.S���in4K� �KH���=l�� �k7F��{�aV:��BM�^��M��]�e�D�mq_�C�������UO%��2_��}��������^p�=�<_}�1����������x���v��ZD;���~|�k�_���%��#Cpm�{�e�=}9t�����y>h�}����e�B
$�O�=��G?����������G+��5����K��@2(L�Q��rS�N3;�P��C@6���S���,L���,������
,8�{����d#�	<A�A��;t^���72 ��@Z��+:{{;���$+��A���|��)|����(��7Qh������9��������9��)��[�����@�N+,}�Am3B$TB�������� D��@��N�9DD����80���+��{>�S
��4�����V(�M�^���9K\���<�;TD�H?5T
}{EQ��8����0��]��.�����N��;�s��x�N��Kd�^��C$[�F_�`<&�����N�@MB(��h3�j4��+BEs��a,G��NXF���N�@�>VDAv���3�A\�����BQC���~�V@@�Z���#H��>������<����7�)�E,�g�H������J���E`L
2<$�B,�@�#�a�)S|��P@H���|�c>��I��[/1���kF��0 ��
 ��J�(���
��� ���A��u$J��Nk��0,V��
��`�^X�5��8����<��P \0
�3��$�,��tl4�J$F��	�K
��x�x����@��K�����*�4���v�I����8����DM�0M�0�@���������
HK������M��-���������5�^ �x3 �3X���<����^��C����k��Q��	`��,N�������U� H���	���������
������%��8�\R��XL-O\�����
(R �^���������"%���(���XR����J�(���|i73�/�� �B��hK0ES���H����L�8E����+��;��w�@�=<���
J���?%T�^��,�E%��]d�H%��5��K�(����E��Ou�AT@�R}��SSUU�	�=��U�U�H+�X�U�)�^�O��^�G���E_%V���^Pj,�e�%1a;f�V�	T���j-���,^��muG����p�JU�Z�s�M�DmE�vUQ��0p�yU�\(����|�G�0	��B�}�]i� ���U�F���\��-��)4W���-�X�`V�����aW�Y
���8�������8P�T��]�(�]8-�����=������!pC�pX�E��h���,H�������^�?r������e��a���]�6�#BY�E[���I��J���!h�H0������HT'R��
[����u5�
�$P����/�����)���d����R[ ����=\7u��[I����AxY��!�����P@���W�����W��eZ��]�����"��]���\��A��-
�%����\&(�B��u��������^���a6-�^�����PN�K�$K0���M���%<�������3 ��TO�DL�dL����R�YH)�N�m�3p�
�����lM�8��4���N���R�r�:��R0�4N�L_�����Q��(����
���$�	@����\�0�������a��%��^� �^X��4x�M�����%����!��
�P+(�6v�7��8��9��-Q���P�����Q��]�����P���	��<�����������8���,=��q� ����D1��!����"�%e�Z�������V�XH0� ��������u�p���^�W�2�)�D�JbX[.�a�f^�9��^��~�fz�f�)��l�WT����epX0	 #��QF�b���(�`�x��!(��0����e@������U�J#�+���#U��vE�=���>W�RG���p������.�hq��!(/(�+���^VH�������in���-%[���iI�`�.��>�"�(<�a��;K�[xi�&�]��RR��m=�!�����jh��������U�?��V&pG �W��e5��Pl��^�HJ�,j�ee(��^���MlSEk����.�h����`5��U���1,-��U�X��h@��U����x8��U9`�)��]�,���\K 2p��mSSkO�F;�R����*��*{�U�����/p����O%�:XF��,p����:0���*7�6���K�`8���L)�2��H��;`�;��'Np��^T��:�*���1�S�P�!��=�3
�S%�����#�����?�����v+q9��� ��(�n�;=�H����;5������#(�4 r<��� `��H���_�=��|�����(_@��P�'o�3����������|Q.sF1�7���8[0��L��-��4wM^��^�M37h0
�o:7�LF���L����aF7� ���!�txhE�D�
Pb-5�!�b�tO���T/����E�'�oR��m�M�uFv��b����pPF��G@M1Jt]/��J�T^��������JA'�7���
��!�X���x��y���z��{�
.dE�X���*�+��,���I�p?�qj�S�:����j���W�?�X/����x,�
THi-���y��\�-�/]y�<	40���7h�$��1P*�����^H�>�*����*���g���I9FH��l�O�����W ������_���x��,���'�?y ��,p�����
J8�+����G��!���y���'�*�q��F�H�2Ku|,���9�)p������Cz����E�y�}Cd��� �����C� H����8lr�����i����j��(G~L��x�Qo������O�u����A�~����|��'i/��*��C����?��'�����~���)�� �0�,h� ��
2l��!��'R�h�"��7r���#��"E��Q%���*W�l��%��2g��i���!�T����'��B�-j�(C7CV���)��R�R�j��� �`�����+��b����KG�,2��m��-��1�!�b���z������R*T�-l�0��4sac����'S���G���7s���m�!��~.m�4���� R�*5���g{�d�
A�w���� �$14�.n�8�9C��D��9����$R�*:�������������$�
�����i��������q��4*t�����BA��h_�	t �
�w�[�� ��_+�g!��V�CT*(�!�)^���a��EP�HA/�0@E��A��8�x�x$�HMp��B���#A����b�g�AC�	D�e�	�4��	4��R��������	)f�3�9(�/����
�@.�1,�K/���0@A
�
�C�R(���t�@`�(@�b����fp�LP�T&A�����{�E��"���P�fO�R���l @�
���A��{.����.����.���;/���{/����/����/��.�P��2��(`��H�.�$hI+A���Bb������<rCgL@�����<�rx=�#���A.�0�:���$=�D���,*(��,D;��Bfh$�
J���Yk�5�]{�5�a�=vX,j4AL��;��Q�6�iQ6�H�Ef����)��NZ��C7z�-���(���#���+D�w�[���n�����j �����:�9�����3���6������<������=)������
9����d�y�6��������34�����c�������1�([k���Ks_d�GLP�= �D����D��@4`^���s���%�bK��r�����2���E��(�(Fa
�pMO���H��
T!�!Rh�
�����z(B*��"!�*�6��1��R@"&D�����>1'��b{��.0�_<@����������[��>�T�vE���] ����2r�LQD��D�x-`���[�����DD
v�c�8�I>�������l@M�d/&Y��D�qU4%�2'T��!�\�*���[:D��!,y�_����%z�b22S�\�&���;�N��LH4��Hn��
1f.#��^��-���3\�"�B�2r5$k����E��IzF����;'b)L��d��By�q��!&#�:�D��8�A���E���Q�D+z�f=�!�<gI��uzjm`�T�@�4��d�G��q"��8
�;%e��>���
�EcZS���" 3���VD
����h�
��"7��E�6M��X��@���[������CuJM6u��T�P��1!�R�ZU��.��
(�tS�<5��\�B[��5���@�*'{.d�rY2U�f� �e���wX�5�-KS+���"#p��:i�6"�R>��[�*���DU�v ��H.���e�V!���2#�>�~tM���.����(���b.n�keU���lBfk��n�!�=Zq�����!��-wm
_�|����.p#���Hw����l��M���tdF$p��"�vyX�&v�,a��� ������xVX�HA�����V���K��CIPj'��C�tcRtx
#����,a
���;����HVrl���Y6���a+���2C0��^�� _���l��b�����8|���2�D`
�U���g�XX"��PW��-dq�,=�0c���(�!��OG�<��4����,x�Ti�NX����9���6��u�d1���bf����~�6��;����Iv��l�@��
y4�%=�5;����Uf��5'��;�.^��d��n�I�y�����L5�7����}�����7�.����?8����3��8�#.��S���8�3���s���8�C.�����&?9�S����; ;
2socket.gifimage/gif; name=2socket.gifDownload
notebook.gifimage/gif; name=notebook.gifDownload
res.zipapplication/zip; name=res.zipDownload
#2Zhihong Yu
zyu@yugabyte.com
In reply to: Yura Sokolov (#1)
Re: BufferAlloc: don't take two simultaneous locks

On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has spare
space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,
Improvement is impressive.

For BufTableFreeDeleted(), since it only has one call, maybe its caller can
invoke hash_return_to_freelist() directly.

For free_list_decrement_nentries():

+ Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);

Is the assertion necessary ? There is similar assertion in
free_list_increment_nentries() which would
maintain hctl->freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Cheers

#3Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhihong Yu (#2)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote:

On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent
replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it
certainly
doesn't fit shared buffers. This way performance starts to degrade
at
~100 connections. Even with shared buffers 1GB it slowly degrades
after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select
it
for eviction. If tag is cleared and buffer removed from old
partition
then other processes will not find it. Therefore it is safe to
release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old
buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is
no
need to put its BufferLookupEnt into dynahash's freelist. That
reduces
lock contention a bit more. To acomplish this FreeListData.nentries
is
changed to pg_atomic_u32/pg_atomic_u64 and atomic
increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were
not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we
want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has
spare
space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,
Improvement is impressive.

Thank you!

For BufTableFreeDeleted(), since it only has one call, maybe its
caller can invoke hash_return_to_freelist() directly.

It will be a dirty break of abstraction. Everywhere we talk with
BufTable, and here will be hash ... eugh

For free_list_decrement_nentries():

+ Assert(hctl->freeList[freelist_idx].nentries.value <
MAX_NENTRIES);

Is the assertion necessary ? There is similar assertion in
free_list_increment_nentries() which would maintain hctl-

freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Assertion in free_list_decrement_nentries is absolutely necessary:
it is direct translation of Assert(nentries>=0) from signed types
to unsigned. (Since there is no signed atomics in pg, I had to convert
signed `long nentries` to unsigned `pg_atomic_uXX nentries`).

Assertion in free_list_increment_nentries is not necessary. But it
doesn't hurt either - it is just Assert that doesn't compile into
production code.

regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#4Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#1)
1 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Сб, 02/10/2021 в 01:25 +0300, Yura Sokolov пишет:

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has spare
space for link to next and hashvalue.

Here is fixed version:
- in first version InvalidateBuffer's BufTableDelete were not paired
with BufTableFreeDeleted.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v1-0001-bufmgr-do-not-acquire-two-partition-lo.patchtext/x-patch; charset=UTF-8; name=v1-0001-bufmgr-do-not-acquire-two-partition-lo.patchDownload
From c3388704432853d9c4cdf6e77b44360b572c6878 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Wed, 22 Sep 2021 13:10:37 +0300
Subject: [PATCH v1] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

This change requires to manually return BufferDesc to free list.

Also insertion and deletion to dynahash is optimized by avoiding
unnecessary free list manipulations in common case (when buffer is
reused)

Also small and never triggered bug in hash_update_hash_key is fixed.
---
 src/backend/storage/buffer/buf_table.c |  54 +++--
 src/backend/storage/buffer/bufmgr.c    | 190 ++++++++--------
 src/backend/utils/hash/dynahash.c      | 289 +++++++++++++++++++++++--
 src/include/storage/buf_internals.h    |   6 +-
 src/include/utils/hsearch.h            |  17 ++
 5 files changed, 410 insertions(+), 146 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..05e1dc9dd29 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -107,36 +107,29 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
 
 /*
  * BufTableInsert
- *		Insert a hashtable entry for given tag and buffer ID,
- *		unless an entry already exists for that tag
- *
- * Returns -1 on successful insertion.  If a conflicting entry exists
- * already, returns the buffer ID in that entry.
+ *		Insert a hashtable entry for given tag and buffer ID.
+ *		Caller should be sure there is no conflicting entry.
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ * and call BufTableLookup to check for conflicting entry.
+ *
+ * If oldelem is passed it is reused.
  */
-int
-BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
+void
+BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id, void *oldelem)
 {
 	BufferLookupEnt *result;
-	bool		found;
 
 	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
 	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_ENTER,
-									&found);
-
-	if (found)					/* found something already in the table */
-		return result->id;
+		hash_insert_with_hash_nocheck(SharedBufHash,
+									  (void *) tagPtr,
+									  hashcode,
+									  oldelem);
 
 	result->id = buf_id;
-
-	return -1;
 }
 
 /*
@@ -144,19 +137,32 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  *		Delete the hashtable entry for given tag (which must exist)
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ *
+ * Returns pointer to internal hashtable entry that should be passed either
+ * to BufTableInsert or BufTableFreeDeleted.
  */
-void
+void *
 BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 {
 	BufferLookupEnt *result;
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_REMOVE,
-									NULL);
+		hash_delete_skip_freelist(SharedBufHash,
+								  (void *) tagPtr,
+								  hashcode);
 
 	if (!result)				/* shouldn't happen */
 		elog(ERROR, "shared buffer hash table corrupted");
+
+	return result;
+}
+
+/*
+ * BufTableFreeDeleted
+ *		Returns deleted hashtable entry to freelist.
+ */
+void
+BufTableFreeDeleted(void *oldelem, uint32 hashcode)
+{
+	hash_return_to_freelist(SharedBufHash, oldelem, hashcode);
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b0..6053a870e61 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1114,6 +1114,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *buf;
 	bool		valid;
 	uint32		buf_state;
+	void	   *oldElem = NULL;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -1288,93 +1289,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1391,31 +1315,102 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+	 * IO_IN_PROGRESS (flags that were not cleared in previous code).
+	 */
+	Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	if (oldFlags & BM_TAG_VALID)
+		oldElem = BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. Note that we have not yet
+	 * removed the hashtable entry for the old tag.
+	 */
+	buf_id = BufTableLookup(&newTag, newHash);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to use
+		 * and put it to free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+		if (oldElem != NULL)
+			BufTableFreeDeleted(oldElem, oldHash);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Okay, it's finally safe to rename the buffer.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1423,13 +1418,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
+	BufTableInsert(&newTag, newHash, buf->buf_id, oldElem);
 	LWLockRelease(newPartitionLock);
 
 	/*
@@ -1539,7 +1528,12 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+	{
+		void   *oldElem;
+
+		oldElem = BufTableDelete(&oldTag, oldHash);
+		BufTableFreeDeleted(oldElem, oldHash);
+	}
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c79..ce5bba8e975 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -99,6 +99,7 @@
 #include "access/xact.h"
 #include "common/hashfn.h"
 #include "port/pg_bitutils.h"
+#include "port/atomics.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/dynahash.h"
@@ -133,6 +134,18 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+#if SIZEOF_LONG == 8
+typedef pg_atomic_uint64 Count;
+#define count_atomic_inc(x)	pg_atomic_fetch_add_u64((x), 1)
+#define count_atomic_dec(x)	pg_atomic_fetch_sub_u64((x), 1)
+#define MAX_NENTRIES	((uint64)PG_INT64_MAX)
+#else
+typedef pg_atomic_uint32 Count;
+#define count_atomic_inc(x)	pg_atomic_fetch_add_u32((x), 1)
+#define count_atomic_dec(x)	pg_atomic_fetch_sub_u32((x), 1)
+#define MAX_NENTRIES	((uint32)PG_INT32_MAX)
+#endif
+
 /*
  * Per-freelist data.
  *
@@ -153,7 +166,7 @@ typedef HASHBUCKET *HASHSEGMENT;
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	Count		nentries;		/* number of entries in associated buckets */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -306,6 +319,54 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+/*
+ * Free list routines
+ */
+static inline void
+free_list_link_entry(HASHHDR *hctl, HASHBUCKET currBucket, int freelist_idx)
+{
+	FreeListData *list = &hctl->freeList[freelist_idx];
+
+	if (IS_PARTITIONED(hctl))
+	{
+		SpinLockAcquire(&list->mutex);
+		currBucket->link = list->freeList;
+		list->freeList = currBucket;
+		SpinLockRelease(&list->mutex);
+	}
+	else
+	{
+		currBucket->link = list->freeList;
+		list->freeList = currBucket;
+	}
+}
+
+static inline void
+free_list_increment_nentries(HASHHDR *hctl, int freelist_idx)
+{
+	FreeListData *list = &hctl->freeList[freelist_idx];
+
+	/* Check for overflow */
+	Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);
+
+	if (IS_PARTITIONED(hctl))
+		count_atomic_inc(&list->nentries);
+	else
+		list->nentries.value++;
+}
+
+static inline void
+free_list_decrement_nentries(HASHHDR *hctl, int freelist_idx)
+{
+	FreeListData *list = &hctl->freeList[freelist_idx];
+
+	if (IS_PARTITIONED(hctl))
+		count_atomic_dec(&list->nentries);
+	else
+		list->nentries.value--;
+	/* Check for overflow */
+	Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);
+}
 
 /************************** CREATE ROUTINES **********************/
 
@@ -1000,7 +1061,7 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		if (hctl->freeList[0].nentries.value > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1057,23 +1118,14 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
-				if (IS_PARTITIONED(hctl))
-					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
-
 				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
+				free_list_decrement_nentries(hctl, freelist_idx);
 
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
-				currBucket->link = hctl->freeList[freelist_idx].freeList;
-				hctl->freeList[freelist_idx].freeList = currBucket;
-
-				if (IS_PARTITIONED(hctl))
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+				free_list_link_entry(hctl, currBucket, freelist_idx);
 
 				/*
 				 * better hope the caller is synchronizing access to this
@@ -1115,6 +1167,7 @@ hash_search_with_hash_value(HTAB *hashp,
 							(errcode(ERRCODE_OUT_OF_MEMORY),
 							 errmsg("out of memory")));
 			}
+			free_list_increment_nentries(hctl, freelist_idx);
 
 			/* link into hashbucket chain */
 			*prevBucketPtr = currBucket;
@@ -1165,6 +1218,7 @@ hash_update_hash_key(HTAB *hashp,
 {
 	HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
 	HASHHDR    *hctl = hashp->hctl;
+	uint32		oldhashvalue;
 	uint32		newhashvalue;
 	Size		keysize;
 	uint32		bucket;
@@ -1218,6 +1272,7 @@ hash_update_hash_key(HTAB *hashp,
 			 hashp->tabname);
 
 	oldPrevPtr = prevBucketPtr;
+	oldhashvalue = existingElement->hashvalue;
 
 	/*
 	 * Now perform the equivalent of a HASH_ENTER operation to locate the hash
@@ -1271,12 +1326,21 @@ hash_update_hash_key(HTAB *hashp,
 	 */
 	if (bucket != newbucket)
 	{
+		int			old_freelist_idx = FREELIST_IDX(hctl, oldhashvalue);
+		int			new_freelist_idx = FREELIST_IDX(hctl, newhashvalue);
+
 		/* OK to remove record from old hash bucket's chain. */
 		*oldPrevPtr = currBucket->link;
 
 		/* link into new hashbucket chain */
 		*prevBucketPtr = currBucket;
 		currBucket->link = NULL;
+
+		if (old_freelist_idx != new_freelist_idx)
+		{
+			free_list_decrement_nentries(hctl, old_freelist_idx);
+			free_list_increment_nentries(hctl, new_freelist_idx);
+		}
 	}
 
 	/* copy new key into record */
@@ -1288,6 +1352,193 @@ hash_update_hash_key(HTAB *hashp,
 	return true;
 }
 
+/*
+ * hash_insert_with_hash_nocheck - inserts new entry into bucket without
+ * checking for duplicates.
+ *
+ * Caller should be sure there is no conflicting entry.
+ *
+ * Caller may pass pointer to old entry acquired with hash_delete_skip_freelist.
+ * In this case entry will be reused and returned as a new.
+ */
+void *
+hash_insert_with_hash_nocheck(HTAB *hashp,
+							  const void *keyPtr,
+							  uint32 hashvalue,
+							  void *oldentry)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
+	uint32		bucket;
+	long		segment_num;
+	long		segment_ndx;
+	HASHSEGMENT segp;
+	HASHBUCKET	currBucket;
+	HASHBUCKET *prevBucketPtr;
+
+#if HASH_STATISTICS
+	hash_accesses++;
+	hctl->accesses++;
+#endif
+
+	/* disallow updates if frozen */
+	if (hashp->frozen)
+		elog(ERROR, "cannot update in frozen hashtable \"%s\"",
+			 hashp->tabname);
+
+	if (!IS_PARTITIONED(hctl) &&
+		hctl->freeList[0].nentries.value > (long) hctl->max_bucket &&
+		!has_seq_scans(hashp))
+		(void) expand_table(hashp);
+
+	/*
+	 * Lookup the existing element using its saved hash value.  We need to do
+	 * this to be able to unlink it from its hash chain, but as a side benefit
+	 * we can verify the validity of the passed existingEntry pointer.
+	 */
+	bucket = calc_bucket(hctl, hashvalue);
+
+	segment_num = bucket >> hashp->sshift;
+	segment_ndx = MOD(bucket, hashp->ssize);
+
+	segp = hashp->dir[segment_num];
+
+	if (segp == NULL)
+		hash_corrupted(hashp);
+
+	prevBucketPtr = &segp[segment_ndx];
+
+	if (oldentry != NULL)
+		currBucket = ELEMENT_FROM_KEY(oldentry);
+	else
+		currBucket = get_hash_entry(hashp, freelist_idx);
+	free_list_increment_nentries(hctl, freelist_idx);
+
+	if (currBucket == NULL)
+	{
+		/* report a generic message */
+		if (hashp->isshared)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of shared memory")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory")));
+	}
+
+	/* copy key into record */
+	currBucket->hashvalue = hashvalue;
+	currBucket->link = *prevBucketPtr;
+	hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, hashp->keysize);
+
+	*prevBucketPtr = currBucket;
+
+	return (void *) ELEMENTKEY(currBucket);
+}
+
+/*
+ * hash_delete_skip_freelist - find and delete entry, but don't put it
+ * to free list.
+ *
+ * Used in Buffer Manager to reuse entry for evicted buffer.
+ *
+ * Returned entry should be either reused with hash_insert_with_hash_nocheck
+ * or returned to free list with hash_return_to_freelist.
+ */
+void *
+hash_delete_skip_freelist(HTAB *hashp,
+						  const void *keyPtr,
+						  uint32 hashvalue)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
+	Size		keysize;
+	uint32		bucket;
+	long		segment_num;
+	long		segment_ndx;
+	HASHSEGMENT segp;
+	HASHBUCKET	currBucket;
+	HASHBUCKET *prevBucketPtr;
+	HashCompareFunc match;
+
+#if HASH_STATISTICS
+	hash_accesses++;
+	hctl->accesses++;
+#endif
+
+	/*
+	 * Do the initial lookup
+	 */
+	bucket = calc_bucket(hctl, hashvalue);
+
+	segment_num = bucket >> hashp->sshift;
+	segment_ndx = MOD(bucket, hashp->ssize);
+
+	segp = hashp->dir[segment_num];
+
+	if (segp == NULL)
+		hash_corrupted(hashp);
+
+	prevBucketPtr = &segp[segment_ndx];
+	currBucket = *prevBucketPtr;
+
+	/*
+	 * Follow collision chain looking for matching key
+	 */
+	match = hashp->match;		/* save one fetch in inner loop */
+	keysize = hashp->keysize;	/* ditto */
+
+	while (currBucket != NULL)
+	{
+		if (currBucket->hashvalue == hashvalue &&
+			match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+			break;
+		prevBucketPtr = &(currBucket->link);
+		currBucket = *prevBucketPtr;
+#if HASH_STATISTICS
+		hash_collisions++;
+		hctl->collisions++;
+#endif
+	}
+
+	if (currBucket == NULL)
+		return NULL;
+
+	/* delete the record from the appropriate nentries counter. */
+	free_list_decrement_nentries(hctl, freelist_idx);
+
+	/* remove record from hash bucket's chain. */
+	*prevBucketPtr = currBucket->link;
+
+	return (void *) ELEMENTKEY(currBucket);
+}
+
+/*
+ * hash_return_to_freelist - return entry deleted with
+ * hash_delete_skip_freelist to free list.
+ *
+ * Used in Buffer Manager in case new conflicting entry were inserted by
+ * concurrent process.
+ */
+void
+hash_return_to_freelist(HTAB *hashp,
+						const void *entry,
+						uint32 hashvalue)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
+	HASHBUCKET	currBucket;
+
+	if (entry == NULL)
+		return;
+
+	currBucket = ELEMENT_FROM_KEY(entry);
+
+	/* add the record to the appropriate freelist. */
+	free_list_link_entry(hctl, currBucket, freelist_idx);
+}
+
 /*
  * Allocate a new hashtable entry if possible; return NULL if out of memory.
  * (Or, if the underlying space allocator throws error for out-of-memory,
@@ -1349,11 +1600,6 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1611,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist */
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1627,7 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = hashp->hctl->freeList[0].nentries.value;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1637,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum += hashp->hctl->freeList[i].nentries.value;
 	}
 
 	return sum;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a8..4a1d6b37821 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,10 @@ extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
-extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id,
+						   void *oldelem);
+extern void *BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableFreeDeleted(void *oldelem, uint32 hashcode);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index d7af0239c8c..1d586ef1169 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -150,4 +150,21 @@ extern Size hash_get_shared_size(HASHCTL *info, int flags);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
+/*
+ * Buffer Manager optimization utilities.
+ * They made to avoid taking two partition locks simultaneously and
+ * skip interraction with dynahash's freelist.
+ * Use them carefully and only if they gives meaningful improvement.
+ */
+extern void *hash_insert_with_hash_nocheck(HTAB *hashp,
+										   const void *keyPtr,
+										   uint32 hashvalue,
+										   void *oldentry);
+extern void *hash_delete_skip_freelist(HTAB *hashp,
+									   const void *keyPtr,
+									   uint32 hashvalue);
+extern void hash_return_to_freelist(HTAB *hashp,
+									const void *entry,
+									uint32 hashvalue);
+
 #endif							/* HSEARCH_H */
-- 
2.34.1

#5Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Yura Sokolov (#4)
Re: BufferAlloc: don't take two simultaneous locks

21 дек. 2021 г., в 10:23, Yura Sokolov <y.sokolov@postgrespro.ru> написал(а):

<v1-0001-bufmgr-do-not-acquire-two-partition-lo.patch>

Hi Yura!

I've took a look into the patch. The idea seems reasonable to me: clearing\evicting old buffer and placing new one seem to be different units of work, there is no need to couple both partition locks together. And the claimed performance impact is fascinating! Though I didn't verify it yet.

On a first glance API change in BufTable does not seem obvious to me. Is void *oldelem actually BufferTag * or maybe BufferLookupEnt *? What if we would like to use or manipulate with oldelem in future?

And the name BufTableFreeDeleted() confuses me a bit. You know, in C we usually free(), but in C++ we delete [], and here we do both... Just to be sure.

Thanks!

Best regards, Andrey Borodin.

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andrey Borodin (#5)
Re: BufferAlloc: don't take two simultaneous locks

At Sat, 22 Jan 2022 12:56:14 +0500, Andrey Borodin <x4mmm@yandex-team.ru> wrote in

I've took a look into the patch. The idea seems reasonable to me:
clearing\evicting old buffer and placing new one seem to be
different units of work, there is no need to couple both partition
locks together. And the claimed performance impact is fascinating!
Though I didn't verify it yet.

The need for having both locks came from, I seems to me, that the
function was moving a buffer between two pages, and that there is a
moment where buftable holds two entries for one buffer. It seems to
me this patch is trying to move a victim buffer to new page via
"unallocated" state and to avoid the buftable from having duplicate
entries for the same buffer. The outline of the story sounds
reasonable.

On a first glance API change in BufTable does not seem obvious to
me. Is void *oldelem actually BufferTag * or maybe BufferLookupEnt
*? What if we would like to use or manipulate with oldelem in
future?

And the name BufTableFreeDeleted() confuses me a bit. You know, in C
we usually free(), but in C++ we delete [], and here we do
both... Just to be sure.

Honestly, I don't like the API change at all as the change allows a
dynahash to be in a (even tentatively) broken state and bufmgr touches
too much of dynahash details. Couldn't we get a good extent of
benefit without that invasive changes?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Kyotaro Horiguchi (#6)
Re: BufferAlloc: don't take two simultaneous locks

Hello, Yura.

Test results look promising. But it seems like the naming and dynahash
API change is a little confusing.

1) I think it is better to split the main part and atomic nentries
optimization into separate commits.
2) Also, it would be nice to also fix hash_update_hash_key bug :)
3) Do we really need a SIZEOF_LONG check? I think pg_atomic_uint64 is
fine these days.
4) Looks like hash_insert_with_hash_nocheck could potentially break
the hash table. Is it better to replace it with
hash_search_with_hash_value with HASH_ATTACH action?
5) In such a case hash_delete_skip_freelist with
hash_search_with_hash_value with HASH_DETTACH.
6) And then hash_return_to_freelist -> hash_dispose_dettached_entry?

Another approach is a new version of hash_update_hash_key with
callbacks. Probably it is the most "correct" way to keep a hash table
implementation details closed. It should be doable, I think.

Thanks,
Michail.

#8Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Michail Nikolaev (#7)
Re: BufferAlloc: don't take two simultaneous locks

Hello, Yura.

A one additional moment:

1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
1333: CLEAR_BUFFERTAG(buf->tag);
1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
1335: UnlockBufHdr(buf, buf_state);

I think there is no sense to unlock buffer here because it will be
locked after a few moments (and no one is able to find it somehow). Of
course, it should be unlocked in case of collision.

BTW, I still think is better to introduce some kind of
hash_update_hash_key and use it.

It may look like this:

// should be called with oldPartitionLock acquired
// newPartitionLock hold on return
// oldPartitionLock and newPartitionLock are not taken at the same time
// if newKeyPtr is present - existingEntry is removed
bool hash_update_hash_key_or_remove(
HTAB *hashp,
void *existingEntry,
const void *newKeyPtr,
uint32 newHashValue,
LWLock *oldPartitionLock,
LWLock *newPartitionLock
);

Thanks,
Michail.

#9Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Michail Nikolaev (#8)
Re: BufferAlloc: don't take two simultaneous locks

В Вс, 06/02/2022 в 19:34 +0300, Michail Nikolaev пишет:

Hello, Yura.

A one additional moment:

1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
1333: CLEAR_BUFFERTAG(buf->tag);
1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
1335: UnlockBufHdr(buf, buf_state);

I think there is no sense to unlock buffer here because it will be
locked after a few moments (and no one is able to find it somehow). Of
course, it should be unlocked in case of collision.

UnlockBufHdr actually writes buf_state. Until it called, buffer
is in intermediate state and it is ... locked.

We have to write state with BM_TAG_VALID cleared before we
call BufTableDelete and release oldPartitionLock to maintain
consistency.

Perhaps, it could be cheated, and there is no harm to skip state
write at this point. But I'm not so confident to do it.

BTW, I still think is better to introduce some kind of
hash_update_hash_key and use it.

It may look like this:

// should be called with oldPartitionLock acquired
// newPartitionLock hold on return
// oldPartitionLock and newPartitionLock are not taken at the same time
// if newKeyPtr is present - existingEntry is removed
bool hash_update_hash_key_or_remove(
HTAB *hashp,
void *existingEntry,
const void *newKeyPtr,
uint32 newHashValue,
LWLock *oldPartitionLock,
LWLock *newPartitionLock
);

Interesting suggestion, thanks. I'll think about.
It has downside of bringing LWLock knowdlege to dynahash.c .
But otherwise looks smart.

---------

regards,
Yura Sokolov

#10Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Michail Nikolaev (#8)
1 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

------

regards,
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v2-0001-bufmgr-do-not-acquire-two-partition-lo.patchtext/x-patch; charset=UTF-8; name=v2-0001-bufmgr-do-not-acquire-two-partition-lo.patchDownload
From 7f430bdaa748456ed6b59f16f32ac0ea55644a66 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Fri, 14 Jan 2022 02:28:36 +0300
Subject: [PATCH v2] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 179 +++++++++++++---------------
 1 file changed, 82 insertions(+), 97 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..abb916938a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1288,93 +1288,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1391,31 +1314,100 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+	 * IO_IN_PROGRESS (flags that were not cleared in previous code).
+	 */
+	Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. Note that we have not yet
+	 * removed the hashtable entry for the old tag.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to use
+		 * and put it to free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Okay, it's finally safe to rename the buffer.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1423,13 +1415,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#10)
Re: BufferAlloc: don't take two simultaneous locks

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

(existing comments)

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.

...

* Somebody could have pinned or re-dirtied the buffer while we were
* doing the I/O and making the new hashtable entry. If so, we can't
* recycle this buffer; we must undo everything we've done and start
* over with a new victim buffer.

We no longer take a lock on the new partition and have no new hash
entry (if others have not yet done) at this point.

+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We

The reason we can clear out the tag is it's safe to use the victim
buffer at this point. This comment needs to mention that reason.

+	 *
+	 * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+	 * IO_IN_PROGRESS (flags that were not cleared in previous code).
+	 */
+	Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);

It seems like to be a test for potential bugs in other functions. As
the comment is saying, we are sure that no other processes are pinning
the buffer and the existing code doesn't seem to be care about that
condition. Is it really needed?

+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another

The most significant point of this patch is the reason that the victim
buffer is protected from stealing until it is set up for new tag. I
think we need an explanation about the protection here.

+	 * buffer for the same block we want to read in. Note that we have not yet
+	 * removed the hashtable entry for the old tag.

Since we have removed the hash table entry for the old tag at this
point, the comment got wrong.

+		 * the first place.  First, give up the buffer we were planning to use
+		 * and put it to free lists.
..
+		StrategyFreeBuffer(buf);

This is one downside of this patch. But it seems to me that the odds
are low that many buffers are freed in a short time by this logic. By
the way it would be better if the sentence starts with "First" has a
separate comment section.

(existing comment)
| * Okay, it's finally safe to rename the buffer.

We don't "rename" the buffer here. And the safety is already
establishsed at the end of the oldPartitionLock section. So it would
be just something like "Now allocate the victim buffer for the new
tag"?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#11)
5 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

Thank you for review and remarks.

(existing comments)

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.

...

* Somebody could have pinned or re-dirtied the buffer while we were
* doing the I/O and making the new hashtable entry. If so, we can't
* recycle this buffer; we must undo everything we've done and start
* over with a new victim buffer.

We no longer take a lock on the new partition and have no new hash
entry (if others have not yet done) at this point.

fixed

+        * Clear out the buffer's tag and flags.  We must do this to ensure that
+        * linear scans of the buffer array don't think the buffer is valid. We

The reason we can clear out the tag is it's safe to use the victim
buffer at this point. This comment needs to mention that reason.

Tried to describe.

+        *
+        * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+        * IO_IN_PROGRESS (flags that were not cleared in previous code).
+        */
+       Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);

It seems like to be a test for potential bugs in other functions. As
the comment is saying, we are sure that no other processes are pinning
the buffer and the existing code doesn't seem to be care about that
condition. Is it really needed?

Ok, I agree this check is excess.
These two flags were not cleared in the previous code, and I didn't get
why. Probably, it is just a historical accident.

+       /*
+        * Try to make a hashtable entry for the buffer under its new tag. This
+        * could fail because while we were writing someone else allocated another

The most significant point of this patch is the reason that the victim
buffer is protected from stealing until it is set up for new tag. I
think we need an explanation about the protection here.

I don't get what you mean clearly :( . I would appreciate your
suggestion for this comment.

+        * buffer for the same block we want to read in. Note that we have not yet
+        * removed the hashtable entry for the old tag.

Since we have removed the hash table entry for the old tag at this
point, the comment got wrong.

Thanks, changed.

+                * the first place.  First, give up the buffer we were planning to use
+                * and put it to free lists.
..
+               StrategyFreeBuffer(buf);

This is one downside of this patch. But it seems to me that the odds
are low that many buffers are freed in a short time by this logic. By
the way it would be better if the sentence starts with "First" has a
separate comment section.

Splitted the comment.

(existing comment)
| * Okay, it's finally safe to rename the buffer.

We don't "rename" the buffer here. And the safety is already
establishsed at the end of the oldPartitionLock section. So it would
be just something like "Now allocate the victim buffer for the new
tag"?

Changed to "Now it is safe to use victim buffer for new tag."

There is also tiny code change at block reuse finalization: instead
of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected
with WaitBufHdrUnlocked. I've tried to explain its safety. Please,
check it.

Benchmarks:
- base point is 6ce16088bfed97f9.
- notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets)
- pgbench select only scale 100 (1.5GB on disk)
- two shared_buffers values: 128MB and 1GB.
- enabled hugepages
- second best result from five runs

Notebook:
conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 29508 | 29481 | 31774 | 32305
2 | 57139 | 56694 | 63393 | 62968
3 | 89759 | 90861 | 101873 | 102399
5 | 133491 | 134573 | 145451 | 145009
7 | 156739 | 155832 | 164633 | 164562
17 | 216863 | 216379 | 251923 | 251017
27 | 209532 | 209802 | 244202 | 243709
53 | 212615 | 213552 | 248107 | 250317
83 | 214446 | 218230 | 252414 | 252337
107 | 211276 | 217109 | 252762 | 250328
139 | 208070 | 214265 | 248350 | 249684
163 | 206764 | 214594 | 247369 | 250323
191 | 205478 | 213511 | 244597 | 246877
211 | 200976 | 212976 | 244035 | 245032
239 | 196588 | 211519 | 243897 | 245055
271 | 195813 | 209631 | 237457 | 242771
307 | 192724 | 208074 | 237658 | 241759
353 | 187847 | 207189 | 234548 | 239008
397 | 186942 | 205317 | 230465 | 238782

I don't get why numbers changed from first letter ))
But still no slowdown, and measurable gain at 128MB shared
buffers.

Xeon 1 socket

conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 41975 | 41799 | 52898 | 52715
2 | 77693 | 77531 | 97571 | 98547
3 | 114713 | 114533 | 142709 | 143579
5 | 188898 | 187241 | 239322 | 236682
7 | 261516 | 260249 | 329119 | 328562
17 | 521821 | 518981 | 672390 | 662987
27 | 555487 | 557019 | 674630 | 675703
53 | 868213 | 897097 | 1190734 | 1202575
83 | 868232 | 881705 | 1164997 | 1157764
107 | 850477 | 855169 | 1140597 | 1128027
139 | 816311 | 826756 | 1101471 | 1096197
163 | 794788 | 805946 | 1078445 | 1071535
191 | 765934 | 783209 | 1059497 | 1039936
211 | 738656 | 786171 | 1083356 | 1049025
239 | 713124 | 837040 | 1104629 | 1125969
271 | 692138 | 847741 | 1094432 | 1131968
307 | 682919 | 847939 | 1086306 | 1124649
353 | 679449 | 844596 | 1071482 | 1125980
397 | 676217 | 833009 | 1058937 | 1113496

Here is small slowdown at some connection numbers (17,
107-191).It is reproducible. Probably it is due to one more
atomice write. Perhaps for some other scheduling issues (
processes block less on buffer manager but compete more
on other resources). I could not reliably determine why,
because change is too small, and `perf record` harms
performance more at this point.

This is the reason I've changed finalization to atomic_or
instead of Lock+Unlock pair. The changed helped a bit, but
didn't remove slowdown completely.

Xeon 2 socket

conns | m0 | patch_v3 | m0 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 44317 | 43747 | 53920 | 53759
2 | 81193 | 79976 | 99138 | 99213
3 | 120755 | 114481 | 148102 | 146494
5 | 190007 | 187384 | 232078 | 229627
7 | 258602 | 256657 | 325545 | 322417
17 | 551814 | 549041 | 692312 | 688204
27 | 787353 | 787916 | 1023509 | 1020995
53 | 973880 | 996019 | 1228274 | 1246128
83 | 1108442 | 1258589 | 1596292 | 1662586
107 | 1072188 | 1317229 | 1542401 | 1684603
139 | 1000446 | 1272759 | 1490757 | 1672507
163 | 967378 | 1224513 | 1461468 | 1660012
191 | 926010 | 1178067 | 1435317 | 1645886
211 | 909919 | 1148862 | 1417437 | 1629487
239 | 895944 | 1108579 | 1393530 | 1616824
271 | 880545 | 1078280 | 1374878 | 1608412
307 | 865560 | 1056988 | 1355164 | 1601066
353 | 857591 | 1033980 | 1330069 | 1586769
397 | 840374 | 1016690 | 1312257 | 1573376

regards,
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v3-0001-PGPRO-5616-bufmgr-do-not-acquire-two-partition-lo.patchtext/x-patch; charset=UTF-8; name=v3-0001-PGPRO-5616-bufmgr-do-not-acquire-two-partition-lo.patchDownload
From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Tags: lwlock_numa
---
 src/backend/storage/buffer/bufmgr.c | 206 ++++++++++++++--------------
 1 file changed, 105 insertions(+), 101 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..bb8b1cd2f4b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1114,6 +1114,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *buf;
 	bool		valid;
 	uint32		buf_state;
+	uint32		new_bits;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -1288,93 +1289,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1382,52 +1306,132 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * We have pinned buffer and we are single pinner at the moment so there
+	 * is no other pinners. We hold buffer header lock and exclusive partition
+	 * lock if tag is valid. Given these statements it is safe to clear tag
+	 * since no other process can inspect it to the moment.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now it is safe to use victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.
+	 *
+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
 	 */
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))
 	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		WaitBufHdrUnlocked(buf);
+		buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
 	}
 
 	LWLockRelease(newPartitionLock);
-- 
2.35.1

v3-2socket.gifimage/gif; name=v3-2socket.gifDownload
v3-1socket.gifimage/gif; name=v3-1socket.gifDownload
GIF89aX��###+++333<<<CCCKKKRRR\\\ccckkkrrrzzz���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��m	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����XL�VA��T�Ld+DNP��@�@�A��3o��S�3%D�p/��&D�8OZP@�i������9��`���4����"�#��];���&����	��wo0V��#�/�7f����d��$�b�e���V�U	D6��.fOe�_�$��-PP�EF�AbLw�-���g����	��uHP��8�XP'�Y��(����-b�W`l�it@dfU���f���yS,�(�@�`IE $'���]�xM	�B
YAy���@����	<���PF�����q���t 	b$��UVAdt�&
�t�ct"&
�@d��Q��D�P����(��a��wH*���E��A\P�������P�@�F��b�	��F6ZP�If���MW�K�fz���bE�b�A�(tX�0W�sq�@�
�@��- 3��b�����M�X���IC��h@��.��KgC%P���'�����7C�-^��q�d�����I�����A�<q	P��@;,���@h�E��-�vg����k���-����T$K/��?��M��b
�}���=�P��
�nMN��A���)���\l����15Do���Y���N����r��q0������	48nPT��\d�v�;�$x~��������-����@��
�@7Rg��Q��;A�&��@�nH����!+�
�|��hg����~�G��8�PQ��������1�pD-��;���������F�����@�!7�'0���}N3�-^U"�'m��r!AF���Nv@d����*F���\F��ao���e,mO 
c�: �I!<�L%D���n����0�)	�;�����C������B`���(�s�l�-�AO!R���:��X����K�����xMP�'- Ve�bd����Q�d��6���W��"��'Z�g(!AY��DF�ZA�&���oP*��"b��!Dj�-�d,dA.�����$��f}��xs�t�q�+����p
����������d�>� �@��CR-32��[��Cr�]�b�@�i�� S ��eU
*�i� ��h��7*��Q��`G�X�!�u�)}n�NP��AQ�"�����i���.�}���Q�Z��[�,g~�d�Q� M�AG��i���H��)����S��S��������jt*�����5u���Y��xK[|R	�$�1���$ (E	���4���
u1zA�u��Hd=��2�IW��t�t����U[��IJ+�3�U4��!��P
��������8���]�[?M�H���6QIu�Z���
��_��K�R�r<���w����I"~P�pb��YnG������C�BH�7u���mq3�bVW���x�X��bT�.C�;����C��@�k�������OG�Q�>Y0~%��B1��7���	R���v���p��IT
:)�	9$���&��� �����X����t"���L�0N� ?����\Z[dr��a�?Q����qc���9,X+�1��|���)�����C��d�����GX��)���u ��6]$Z�L��'iS��	q�@�:U��>$.Z4,IX��r���O�>)!����02<d��L�*p�|��tkg���?3%�0�d�m��
��`@	�Rf����6ut��+5�IPYE�
�T�.���i�����MN���y�/| gn@�����/b�9d�p8������TrT\�|aC�-�l�DQ2��vy
��p���N
������g�NF��2�N���@�G352�@" k���J�^��p�f��f�Wv\��:"�U�pU. �[�����K�l���"~����\�h�K�=iUsW	��P�C�xo}N82�J�Q`�$�	2�������yRuh�Z���c�ZP��C�������P�C�t�����,��/��Md�vn��^�C���*�O����������^���3M����{��'����@_���*�D�^e�M�I���M� ������'*~:��ud
K���W|f�!vs.���dwe�A�d|��ICR[�@��W73�w ��wd6b{�^�t}A7�qv��;�Y�Pv=��Ik�:vgQgCq�~��!Xri��B7��k�30�f0��nEG{ �����q0�	�/d��u��w]';&���yP_C?����p(<@0��2+p�/�l%B��7xA��`bx]��r�
@6 �"�� k��7�V�X:=X��x����������������8��X��x����������������8��X��x����������������8��X��x����������������8����V����������@	����p�@��yp������v����d��c�c�r����i���h�H�p{��� ���!	��8�Q����r�9v���Y0�4Y�6�'1w�<��>��)�:�(Ik�b@<��i�N��P�K�C9����y��`�b9�> ���^Q%k�IY&_X)PWJ�x�xI�9y��~��f)�i	��&$R�
�Q��P���
�����]��d��a�D9�Q���&#�O�0�)�q����y��9�1+��N1�PBcX0�&������	Q�q�q�����06����{���$��
�4��	�����K�����4�\3��yY�yT��m�)�c���%@9�*�[���������)�y�� ��i�`���	���Q��@c(��3uv[�����I��d�)����9eR�A�0w���H��B
��	q�~�4 6ST�i�0w�a�@%�@�s���/`,����Y�QY����c����I1��|S�p��):�4�����(�1��)��,S�p�w���!�SZ�D������]f������$)���8�7)�Z�E�">m�"�p��%��c+r���@
���B{:��J:��8�32���>���Q��R���d��@��
@
�#�v���&:	��4���j���"�%�J��:������:2@A��e���gX�J��������	f��j���+-C>BcC:���	���&"������6_�m �J
�HF��,��f�L*Q(����Z�D�E�2��;��(p��8��L���a�!P��*��*�!;#�&��:����J��O2K�c�qO�S+�	��;����#��6P$rb��� �@������>I���sk�wK�
��,���@
�:dUPX1�9�K��=���6���[��>�h�h����/�0�i�nR�T[�`��X�z�k�\�8�kE����p������������l��Yo�k��8:����X�P:9�!�'@��i�<��m�����Z�FkW��{�FD�+H3Q�9������>����%������8�k�k���5�j�3��������
�B���:)���b�0K 8�!82�)>�u�'��H�3x�K���#)���!P0�xDRKcSJ
H������( 6��e0�*��Q�����MQ|��d�0�a�<x���������,�D�"G����L�,�O�e�`�Q�����|��U�Bbd��\l^�TaL���0kp����P��fWB��������:���I�Ou����P+0��/75��9��;����p��tE���BP1*)p���$�R;� 	��n�N9����#v���9�,����Y��L��A�������������������j�=Z���4��%���7�R~jaE����yy�jS�UL�>����\A�\�U�D���|�M�&=M�`1�,}�2��m�H��`
������9�U!�9��1��?-�4m�E��TA����L�=!�DA�Sa��i�T��Z��r�L��h��D��p-������V���[���{]�_�����u
�e]��}����������������=��]��}����������������=��]��}����������������=��-�8 }g0P`��FZ2���u�M�� ����|��f�P(i�\� S�]��1�`"�m��X�3���B��M0�-;ZB��7��m L�]��-�c^���@��}�9���	Y���j������
^B������L �E�	������-�^'���@�>n]:2����j�/>b�m�	Q:��<��>��@�B>�D^�F~�H��J��7��K��P�R>�T^�V~�X��?��j�p���s���qa��Xi<�@��P��f��0����A��TPi��&��3�^b�R�.;�N�`���9@��
�-�`�B���.�����=�NYm�L�L�
��LBH�����>��o@��#��?���

�.����n��O)!��"�`��`$��~������#!�`�E �  �a��������
M����m�)��^�P��������i�����^q 
�
p��n���
=0�!���@��g�P�o 
	�PIn��m8��	��i�m������	h �N���Y��
��u��R!���c���
��B�������N���=�8��W'.��0���J�>��q�`�	�
���B`ioa������>��!�`���
�J� �B`�1�`��mo�/�K�F/��
���b�m0�{!�b)�A�b9��M��/7/��?��9`� A�@�di���o�O�P�#��
��
��#i��?���`�	���y���Q��O�`���Lo�9����^�wA�#M�LP� ��0�������M��?�L���?���
�
*H*b�H�BD�����-�-^��J�T=~R�H�%M�D�R�J�-]����L�7Jd��-YpDx(���P�E/"�	��������4U*H���`����B�}����BCW%
)�R*V i=5W�\�u��u�C(��e7
�����&�l��C�	9P-�*����]j�Yh%���V^�ul��M�iQUT�3�&jSJ���l��m�N�(�,��lxp!���x����Op���Co����	�%$$=��]�hcC+�'y��������_�f�mA����`��C����4@������,�q.�> a%Q>��B�;��4c��0,	���C�DO���"PL��9#)X��Tl9��u���F@�e��x�P��BYc
G`�$�6��F)������R$Zi/�&���$��#J��HEH�����3���Q�P��R����D���,%���v3$��zS�3>R$U"+bO�dQ3��h	3@������'5{�H�P>��
3J��T�%��JzC�qs�7K4XaMdb�a/�*&q#�Kx#���
��-V#�43beS	��(�mW��GDHS��@��8>�SZd���Z����cU���K_�0$4�@4�mB� �Ke-�9���?���`	�S-5MQ@���3�LWM94��G��9g}�#gV6�b�#TA"!Y����n#l+�%YUA	!��K��M�5�w����>u�Yg���
�����)EV;:��r(5%Q�0C�d�$�X�T�"���+L�`�p%�4`]��6��S5�d��6PF;t�C4vt������D�|q��QR�6Mh��h�
��H�&^��m���9$�?�;f"��("�7=z�cbz�����"w��J�����#r��D�cu���#���`W[8�~�<K^$�e���x?�U���&�Y������H���	E�X�(�78+@gh��X��,
"���~�9"E�'����.}/
o`N���g��A�l��ljn1�7�
�y�Z:C-�A�(�l2$�y8�}KV&�a�e��i�#�j�GX�_!B����"����$W�"Z�>>�jk^���|0�<���r���H�6�$N<Lo�H�1�s'�c$��8I��`��i��
E �=�2��*������D�}pH�T�DA�4��������a+O
}�g�fot���\E�M����Lf�G�i�J�FR�I>Pf4��M�������\RMp���*y�����s���'n���������3��O~��/��g@:�q�@;EhB�G�`���@�Q�V��>��E5�Qn����m 0�D�"d0�@�&�H\
�
����MCZH���;jK8PNp���P�(C,0���b��D���2�B�O��V�Z�Btq�g��@�0�	�'�@�Pk�]�������}�+Z�i��(0@':q�D`
(�bX[���x�+e)������q�gE��X N ��Hge�Y���$���Om;ZZ�b��,����� ���V�&$�nt�;]�V����nv��]�v���/vk�������Eoz��^�����E�{�@�Xt�e�TE�<��0�lQ�Z��60���x�%�a<��@���)����&��`�
��8-��I�b�RX$>���e<��tb�4aEo�c�8��-9�C<d"�E�+r��\�-��O��te*y�U�r�/"�h��ZN���0:���g��[�G����K@s�G�P?�����s���.���vn3Z2R�B����(�;���0��O2�ySFt�#��yd�N��J�zv��,@DGZ��'*z�%P�yS��hPj�:��C�kbGS	}���L�������J�h:����F�(��-�Bj����w ���n���@�qEZ�,�;aP	@]@���3�)J2�Gb�F��p;�Kd!��w$�	�S|��x�)�p@� �"���%B��C���p���B	&����{�h��u����|�[�B�������-fA�9T�v(4E��9���^��["rQcZ�-�O�0�Z�5�wJ(a������%��H\��8E ����:_���@����>K������`O��G2����%5��GjQ�M�������������%���M"x@�:%�p4�!�$tX{���������@	,:�i������gQn7�>%}w4 3�������I�)�zDn�3��#�����$�$a�,����0��z����`r����o��C2�U����>��$�@c�|�P�%��9�;��[	�����
H��0��,��-p�(��0��A`�����@P9�������U0B;X��6H�9B ����0��[�Y@B�3S �Q(�A�5<@2��B����3���������@�I��Z�63�M�*���V;����4-��Fk5S��<�Q8��S�jJ9@����B74���<�i�3�@xS��K�<�����Y��@���Z{s��L�-�W�;GDAs�!9l����6+�`�6�r�@@8?��S �A��F�;�D�8?����BqJ��,X�Z��R�E@�!S�F;��X�<X�@�2R�;p����[�^��K��		�/@�`q�H��H=P���<:�#�#���@8@<�CD<Z \��3R�G.�������Z��/�;RG�;�$H�P�G��S�9x�G(�j�"wl3x�H�x����K�8�p4,h�#4@`�^C��;�6K���,<`zh�a���`N	8qX��t�p��{[E��S@�R���4�4�W������# ��<W�����,�����li�7�C��p6������$�J#G��)	:��0����@�C���V����I��X�����A�C�X�@�:���6s�Y��_���|�x����<���ILP�[X�Ah:S�����<��<���&,7@@���':	Rx�0�0 K7�9X��Z0G�{�9�[I�T	������B �%x�������{�7Ih��LB�����M���$\�SX�&�����'H�H�V�J�<�I�ME���4@|X@!����w���y�%��T��/������,�������[)		�Q��5�xR�QTM ��hL�.H=��N;K�&�(%�O��<����7���RT�eG1�L�N;����zs
7�c	$������HM�K��4�@����S`�
���<[M��7-e�0�cT�Z���q�R~:�zsS�PSj\T���C�T��7 �IM���HPE�s����M7E;"���8.��������)h%1j�0����:��S���%{	��XH�����*���X��X�*���������GA������M�!# 1��+�2.�r��*.�5��ZF�`50����DU�C���N0�
����������k����R�Q������4:+�J��H��rz+�����������
����`?����X{3��LL`�L���Z
��HK�����p�����[���V�R��e���.���������5/
���r�+x'�.
��'��z{�
��}��e���Z#1���k����c0���K��:���~�2oMT���8��pc=VX��2�]�b1S������/�=����T	W����2���T]*Q��K8KU4����z���'s�������2��7P�����U�m����X�E��H,�^(E�;��X_��70@L������,�S���HXE���B����B��=3��S�`�mX�<���3_�h��m�M��-��d�����1���&�`-8�-D_{��0�a"
I��C��K���aOM�SE�xT��Q	�����������
��uLCN���K���kWpX(�qK�����^�MF���)B�L��;��NF^�DbJ�I�8^�0��-�0����1�3Fc&��d�)�8�8��7x�6h�4@�3x�"@�!��M��������+�7%N�#N	7�9XPc4n����K dJ8dK��d��� 4H�4e6`%��P���c[�T���4`�8��S	,�7?.��%��%Z�%qe���)h�m6E�����c"�]�o�V�
C����D�X��_y��Z��"M������&���� ��"�Gx�6��Ihh|��0���Y�����D8�c5mF���4�������a��i� �`�p�����P�w"�UH��������(0i��������p���j������ ����6���������vk���`5�~����7���&
Bh�����������6l���������6��Q;p��l��j�H��v��k�8l���"Pk����m��k���;(���8����3d���Nk��R@���vkZP��`R���B2
@��&���l�8�:���\n����L��J������j#(�?��C�.��Kp�L�n�^m���0�IP^L��{ARMh@����~���k-�n(�aW������p�����4�F- �;��?����s4�����O�m���.q���&hP���M����JG{Pwl����?%�b+�j��kE1���V#G�u��0��Q�6��H"��+�/p*�b���v}����_s����*x�#�*��,�X����
���,�EXx2
�X�M8G%�[�r�������"Y��,H_���5C�
��
�]10�����ZI/Zs�)�<��JJ���@���p��pX=���h�_�������cG�dW�eg���O�,��t�/02(���j�\k_'Z�nSD��Hu�x�������ew31���w��	�/p�r'�M(@�@�0�wm:s��D��=@��wl6���@�fx���C�?����4o��5��k"pq��x�l@��@A0��n����I���w��w�h��y���������������p��L�;��jU`u�p�g��V�����z��F���������R�z�j P���z�gjl���g{����/���X���p�����*`6��i�{|1��E�+����i��B�{��$�������w��L����m��E��.����fZ	��#w�TM�.h��`7H�#�����&]���D=�����sv�H�$~F�����>����S �����@�P�1��?X�����{���I�A�Ml����O
�������$0��p���K������l,h� ��
2l��!��'R�h�"��7r���#��"GZl��CK)H�l��%��2g��i�&���r�p��U�|r-j�(��J�2M�(�ZiUIa5E��Z�r����+����(����)Z�m��-��r�:��*���^�B�/����N  `A���@�@��G��rc��K�@"�a�V�n.m�4j�cpr���1��4 N,c
�.���������;�AZ�B������2D[En�:��*S���b[����x������<v48z����V�P���?������0�x����*� {������J��u)�" �%�xL`���h��yRa�4Bh-�=d�a�������('y$�--�')0�a 0y�2E��Z:���U0&�e�y&�i��&�mV���;�Y�-��A�����}��'��
:(��z(��*�(�mU	U@�m��b���AK���Q�,�VJ��)g��E
u$�*�����JA
�Z@��,�ZY����u���!C�x��#�:-���Ab����!Mb�)��;.�e
�>4�
��B
��;/��Z+�]y�I��{�,U�����|0���H0C_)-�K<q`h�`�A�de&vI1�!���NxT	�(���-��J��1�<3Q�l���UE�4��3� �QB��@]�A+�4�AJB���4�U[m�*=��A{Qw5�a���k0���b��6�A�
�Ub�hw�}7����a��2E
|���U4�}8��&R6C����-���)�[~9v��B���B���>:j����!����-��)���.X*�&������;�q-UB{<�;����"�bg)�;��<S�'R�,��
=���D�!���
Az�����>I����p�
�-�:�����#B����R��t ������H*���$0� \��	l������-R�ZBG@h�BB����|�����fC`(p`AB���z:��-��������&�����S��x��%D\H�"����V��
��#��!4�k^A^�"^�������hWt�h/�C��F�T��vL�"����X��4���;Ph�
�@UJ�����$$c�
$`(	� HV�N0���$(G7�(�8��R�	�&���,`�� y�{ u��].
��_AAK7d�bD�#����>;�d�
��`
�(-J�7��^�
�)N���w=�	R�?�k|C� ���A���"#��
�Mn������~�/r��<��������D(Di����i�����bh+�(HiF��`+�`HS*3���*}��&�����/��P`�Sr��5��N�����G��F������r%u�N�V�l��QV�*V����r��^�*X�
(�t��y*ZcU,d���n}+\�*�����v�+^}�W���)�a�X`
�H�X]P,
`����V���D��B�I�c)`�v"(�aG�Z�����@d P��zv�(iS(A@'�%p���!�p�Ty�M�6D����:����1�5�TvK��2���U,m	r���E���ek��
 ��]�uS���j���
�u&p�dD��qE�Z��a�E.UD��=�V�1��x�A,���8pD� �N|��
���_�����tWl����:����b��Ll� �q0Al�	�����|%;�!���8`�XXXr��XH� ��B�����g�����Z��@��p��a;��\.2��c�����|p�A!C�h�
`����`D+:6���v��+[�!R�)���IS"d@,$]U���o� O;$���5�p��W!����3QZ�������sX_7��7:v��1����j���FD���B����P� �n��-m`O�!�����d����
Gd��H@����� X�D���p�� �8D�p�|"���#0�@\=5�8�Rr�t��D'pp�?x1
(��U��+d�(�e���7$�����kC��T����9�
0j��Y��YzBD�i
��
��m�q�r��
)��_�Q=;I��H���2@�!C8Uf� #/H�+w���!x�{Q�nw1@�/��!����!�|�G+0��
�� �8 ����<C.OX,�<oo���^��/$���4�z�#���G:��x�O�!��{-:��v�~�	���������E��	�P����D>C��OdG���_�%5� �A��}�|?"�������G�A�w���-l��0����x$����A����C��A\�%����!D���C,�����y��B��������f���v\�Dp��i��!@eX� �%��M��U����N��n�D�B���� �u�kTa'�`D�����J��`F@d!�>���k<za�=X��!A������)�)D��a�B'�� ��-���9DZa�|G�-!C����a!^�"��#�[�A��JE��'��veDH�)Z-�l��-�"E�B,�bn��A<�$���"��ZU�/>����0���4*�d�j���6J`C����c ��.�F5B�5�A�#/F#92�26�T<c���B��h�">��^��;V#�1�� %�B6�C>$DF�DN$EV�E^$Ff�Fn$Gv�G~$H��H�$I��I�$J��J�$K��K�$L��L�$M�$E;
v3-notebook.gifimage/gif; name=v3-notebook.gifDownload
res.zipapplication/zip; name=res.zipDownload
PK/�PT res/UT
�
b
Hb�
bux��PK�XUT res/p75/UT
�Gb�Gb�Gbux��PK%EUTf res/p75/v3-2socket.csvUT
�%b�%b�Gbux��]�I�1E�}�@��	r� �&��r������@��H����	��|������	\��ua�,�E���Ker}y8�d+�t���j!��#	���rV bZe��?^&�$�:�A&�D��)��q�v*�0�a�6y��
uM�(���|��������T�/��g���A$���7�*��L�2
H�H{`J<j7��!�%5���qh��":��GN5�!Z��(�!KD�sNlQ"��k�a�y���APy��2^�ey��{e�ke�J4���~/��I;�B���
�9dj�T�q�������/Kb���w"V��>4�\�B��4��2hUH���:W�JC���c�J��{���:M[�:���k�Y�-���\_��!3���_�H���O�7K:��DMR.��Ra�+6]��m�q��� ]_��e�w�����n���]������q���Z���t�����[����.��>�kgz���u�V\������U
�v�0��3�5�PK�=���fPK$EUT� res/p75/v3-1socket.csvUT
�%b�%b�Gbux��]�K�1��s���-� ���f $C��M���B]���������?C��}^Gfti��>�w��6PA@�3�G�~)���v�����/�p{L.�m�A	�m���N `�J��!r���O����f�P����V�e���Q"kL��������hq��G(�B���=J�4�VS<'r�)���[�bjZUQ
��*�6�y-�W�/����-�����_'�}�"�
��+?���7�3�>iqN-��ig�h�k'����
�'l�gK��Kl��l������#�&����
7g�/���&hL�W���a�1g�����&)�Y����r*]M�P��f��+�������js���_)�C�*������Z��g��0��2>�*�C�H4��!V(.z�K����
&��yN;�Z�H���t^��O|�@�P����N�+$� ={\1qdC����\�x�)��7�'��OV>�W\����+�4�|O�PKo��7��PK$EUT� res/p75/v3-notebook.csvUT
�%b�%b�Gbux��]�Kn1�s����:A�Ao�����d��^� =Q�z�����������\����^K>b��K=�Y��~��_��B�3.,e:8w���@;3�B��"�2%��$�����$+*~����r(sp����1��s�����d��1<�O�rO�z|�}�|���[�^ EZn�����)�|����3�����J������� ���ba��Z2��������K0R	�X���6p�������NA�p�q�Tg��/���{���� t��A����D�[TJ�&yt���9>MV�Eh���Y8��J�����k~��pk�`�	��8����K�1�i�"�:agr�;�={~f�M���?�	�A�=�K?����Iw����X��c�
v
;�"�h�����~���`���
|��,�y��J�i��YJ�Avl�h�w1~�����;�l�I�z��q�V
�����p��%�?=�PK�e��PK�XUT res/tbl/UT
�Gb�Gb�Gbux��PK$EUT res/tbl/v3-1socket.tblUT
�%b�Db�Gbux���S9NeA��)^>B���`��B"a��&�������Y��\���8�^_?���z^>>�����������|����B��u����/��G��%}����@)26p�c��o�-y�)[E}�P�&�{Dq<�B9�����#�9'i1�d�fAM��5T�@�bL��A�5���A-�X.��640���i��8�����^d�X�Z��"+�U�,��,@@�=�8��c�@m�I�JZ#������^�{�d
��3rwh��QP�f��+���&.��mO��9GiI��CD������4;�����
���6p&��L��9���p;
��$+�E��k��Z9�A`z���	L��n�|���\���H���%�72��j�9KU���9j�&z��"��Y�-�,�n,A'�����F�K*"8���$�Re/�PK	���PK%EUT res/tbl/v3-2socket.tblUT
�%b�Db�Gbux���T;�A��)&G+��9�@h$�������T��mW���^�����~_X_�/������O�'�������Y^�Y�	�{	��s��-pm���u���b�-5�3�Y��
2�Y(���P�'���6�7���r�� :�"�H�$������#'*������������Pv�(���B�rN��7h>�M�����y�{�,���)I�`�C�Cf�2�xy��#Q�$�]�'�!�|���g\FV�M�����3�6Cd�6%�}��`�KH���t����#nN�����K����F�'�aH�hD ��l\��l^�Yx�c�
<Y�f(���,mu��|U{���{�&�I��b$�3$��<&���%0��u���(z�}��x�������%3���������j���(�����3���F�CHcI/y�[�=U�7��PKUtK��PK$EUT res/tbl/v3-notebook.tblUT
�%b�Db�Gbux���S9nA����
���� pbI�G~���n�COVl���u����~\���������;x�|����o`��{��zx\�����O�pW�pI_�
1@�_��1���������,;���PZlrG#]����0�gn��(�d��������&���y�����!4����"�&Z`�"�$��"S�!��
4��w�:bd�
Xu��5��c�"b
����$(���%"��az�:�i�y|A�0e�����/R�E������F�ly*�"�/T<��!��5�;���3MJ��� <=��yL��1�X��������������3�4�����M&/�����������!WQM���}��auv!�U��������k���`������dd��%R�PK�w:��PK�XUT res/csv/UT
�Gb�Gb�Gbux��PK$EUT� res/csv/v3-1socket.csvUT
�%b�^
b�Gbux��=Q�MQ��

�!�Re ������7q�g&Nb{������������?��^~���N��?���C��w9mvP[���k5]�F��&g�h� ���^J���F�-����Uc�l���e�U=0YF�8�c��X�\�q�*�i2�3c�`�:���8qJ��
�������`5�����xF[�T+�"&��4OJ���cQ�b��5�/�����:r�`�-DIhro�F����������e=n'�C<W�w����}
�{J���~�I,�ba�������#�Q�[���"
����F��bnk��q,�c���y<Y rDZ0�X{y����r��.���@�y����zl�/PK��(^�PK%EUT� res/csv/v3-2socket.csvUT
�%b�^
b�Gbux��=��mn1�{O�T�zk�;F�I�� �_�/X�1-�_??>��]��������9���%��h	�hI_S��{LX����]�p4�q�ig
1pPc��8�����i�s�G���O����U����b�>�� $c5 ��0��8��l�~����z����y���t������Xkl��:�%H%I�����\tm�������P:��R?�{�j��y��2���Tz)�T;��"~�7�*U��B��������=�.�C%�v�0���3��E�mL#�ln��[6��l��Eb��A��M�����s��q��i{*���Y�p����h����}����.�Z7���=���7r������d�w��PK����h�PK$EUTt res/csv/v3-notebook.csvUT
�%b�^
b�Gbux��=R�MQ��*( ��K����~�q�����3����Ozy��z�����������������#��xz�T9��Q���qJ�1J��c�S14�)$,]P�9Ab���yl��C��Id�T"��$��z�H%;m�jHCFm�����	��w��
��2M	�El���	�X��v�4D]]|��������%��q�K�}6	������������[sV������r~�v�(�W_������g��
"����l�f
�nh�����c��l��D�^c��f�
��*9��f�������`�����!�4��@o�����4��M?lWe��A��PK6��{LtPK�XUT res/gif/UT
�GbHb�Gbux��PK%EUT�8 res/gif/v3-1socket.gifUT
�%bHb�Gbux���x�W����s����-�G7����#TR��)AGJ�t���Q�0�A���z���=7�y�{�������V�^D0����P(AA���������ikk����������;w���'000,,����
�N����dggT���V�T��T4���*yQYY��RSS���<]Q2S]6�R:�R2__��\�iiXlk��.���b��b��~��y��z������-5s5Mu�s-M�-���������������������������#%�c���%����C5�5�s5���`�������xg�~q�yi����6�ai����������F����������u���}�����c%����#������m�mK#
���ei���sppptttvvv���c���������7+���&z�����l��|��2;�=?�����c�����O�oW'�
��}�<��i���X����O�Ck3���S[��6gg�L�o�������wmy`mmhsytss����/[S;���[s������7++��>
���k���S[K����o/��|�����������������������/[��>�~�����������o�>�}�������e���o�6������������o��v���������Wd~B�_T���'�����d��������}����������j��V��<���$l��R�]K��U��|��mk����_[���y��>)iW[<�o��z���d�9!�.���t��X0�_W6����&i__>U�7Zz~���LC���������o�s
/^�!()���Ef :�O�r�AP�[��/�[���P�C�@�<x6~�t A���(�U�G[��(DT3�PD[���Oz���|�t��(�bE}�,>�%�tpA��Vu?����rT�y���E��k^DF����S����S��w)zq��w|X~���b�<f��qGX���1G.'/
2.������O
����U������%����
�34�p5pdC`$��	J\��gv�r.`5b4�9�IA5����|��cS��K���-Wu�4�<|�V
���J���	s���F������Lo�5~��N6Mc6�	�	V�%b����`P������[�8����|}_8�}}
�s]���\���P��AhD��(�d+��t����6�}��d��e��x	���;�o������>��)V����l{�������}���I�oK�7�t[�\'������k��<h�P%��k��a
� V�.`ys�QT`�c�����~�V�<����V�%�_6W����1%�]�4�L�<]x�)��]�*1(�@}�mN�N��i�1 � ���`#uq<�
��o����~�5���'�Ug�:r�I���<���A�����.G�����{/����1�`�G'6$�4���]�k(��\�������1�����3%��N�����/x(.*�:��_����Bq��+��n�o�����c���8Pvz�F���k|���
����s���F���������0��E�^�����f��3�w�����"�c��ab����gN$��5���q���c�x����`kE�q�V���rm��W���	*���!������4<�e��J��a2�V��]�S8�(�Wj�\���j�&���U#o�����L�E!�z�I,�gj�����';�y,J�"�d��p��v�(�����|s
t�a}o����(��k��l
������2.3���������E��}A�7Q�<o��aOI�=uq���=�9����+��&�6���.����.Q�X�(����!�YX�*<$M�*�D��T[�e1���2 ��������������������=H����k��y�C�Qx�,y4s����Mi��M6����u��.�Z]�gS��t������sx��`�����;R�C�����W���:Iq��G����C2��7FU�{W����XS�sT��kC�>���Z�>2���:��_�1�qvZ���i�w���XG���]�?H�x�����d�
0!N�f5�������%�T\W�~��Qx�5��������w�.�>yH?u���������_B�,���C7��1��1�a��{��H+���<�z���X�p����o�vj���:���drv	��Y�	�k^T����{�c��p����k��9������-^th9�B��=_�f\9�t��k��T���,��0�k"��~\��k�md�}_�����q��������?�O���fG�#��n\�G�����v���W�Y����M�g�NZ�9o���8��B�����n��o�.l��������e[�C�*����u�[���i?���o�;����*wx�s���(��D9�q�7xB�:8�\ �.�����r��2���j�G���T�%C�h�<���7]���=^�eM����U��Cd@���#�{^�CbJ�A89E�����������Q�qay�����-��[{��DE&;�U��2��v|���nu������b#'�Rp�����k��	�@�\.�o�<��q��T;�o%���@>�v�=2\����K�9���;T{�F��.vY�
?>���n(����i�K�/��m�][q�v;�y���@c9h��Ykk-���r�f���kfyk�|���$f��P���{	/.9e�}L�������O���yn�xC8��^J:~.r�P���)
q?��%�g���Gj�W�����X����������5�����_;���L���������V���JyB���0�/���w����'�s>=9=�y��q�'�YB������mO��������x}��W�
=��;�2sT�N���>�������"I{���2�����Js����&����nX�y^�t������1[D���(��7�����p��r�.�
h�	���H���/���|I���wg�l�]�V�#sOK�iK~��mH����}g�3��=���hh���F?E�� ����m�
�/;_6*�o����b?0Ca�#�%����p�)W,���0(V�3x$�B3��Bq��^���)��w����b���AX������XQ6������Zj�)���KA
fQ��"`�\q��`��>_C	�����cA������!�F���
�������Zq���V>I���U������Dz}�1���Ho�����Y�	�	�����������������	l����1b���������������D���%����	���f�V>�l|��G�����g�iBI��B���i|��|/FO�b���O:���cs�L1x�����]�G< ��_)�a��I�x8
�,�`��3�������$1�p����3��3)��0��7����S)&?ST��L��-&p��7�DE�s@N?S	b���0��N��@�	���-BHbi�i&���]OgL���@\�Q�#�Ns�
D�D���J���J����G`JG�]���\�M��t�a�,������6�}33��2,2��3����3Ov���C3;2�o����CG��;�����G%�4����'g+����~�#��~�C�uS��'98]�A�H&�u'��H�aQ�C`%#r3�2�f�+�%:O�i��o�i����3���������[�j���tVIf_S��
���'���1�Ac*�`��8s-L�}��~�k��?��'����mI6�7��V���1�
.��W������zf/I7�8��D�f��D�X]W7���G��n>��� !< �@�s�^����D��K��I�����g�$�y�����^���E-���������e8�c�?���*g�ef����B
n�K�~,�(t�(��.���r�_�XX�_T�'������������%�����esB2��i>bja@� 0���#S)���F��V��kkI����
��?�uF��I�T��T_�����|$�QN�/�\��6D����Q�@��*�vL�]Z�!��V:�������4-�b�`L�e����z����v~�R3 %����9`b�@>��U�l)J��(�W��+���{�7X[e'���X�A���r��X!��������Q�"���l-�v%��*�
�+��� 9���K�-�E�u����i2Fh9T��F�aI��.;��I-�a�(W���u�jy��t����R��J]$&��-&�<E��Q�������;U&�����k��	�i��Pt�z��f�2�E��D��E �����b�t��6��k0��:@~�j�����B��+.ng�H�����E�
������b��������EM�tf�@:��4S@]=D�o����f5�
*�2(x�����h��"�(.+���sY�N���~j���f��<m����Tp_�IK[{�B�(�a���/���W��P�����zxn��x4r��������P��'Y���������j�*�&)7���X6m�b����9��*	��q\�{1W���{���.��,��]�����P��u^�,n�wt�&����n����'5�4�'Tx���i�-k���i6�NJ^L �����ABl����%Z[���1_f�� q�?n^���z\ �'�{6�����N�I�����	�	�	)9�='o��F2�5~��x���1�]1��Aw�Q%�	(�y}��,��!UUxB��~��u1��"8%W$XN��?5r���-��?��y������2��������J?(�xpev��'�p�q����Y��l���R`����~��eo�w��[�`�����I&;���=�LQ����oZ���W�=s,V<.(�XD::�~oD+iD4xP��Fz$��E��V������2�C��G����c:���U#n�c	b��F���[���z�����"�<�&��;��
��5��h<�����=%4�.J>;Q��s~�B�De���C��O�����L��&�&��dHN%�J=?�vo*#g*�m*{n���)��t������E���s�K�����+�NWJ�T����D��C����(	��<#ixhv|��rL,����,�|7)�RA���W�u�����9�,�t��E�N�E��#���8�[
#3�3���F����/��X��yN%������$u��g��8o�������4��7������X�FlU6��5��=��x�ix������yC-���F���M�_�� ����L>����L�-<���/�<]�k_�cf�,�V\LV�<WN�_!>]91�=�
y�� ym`n&���5�����1����>^�J���4�����a7.^���c��.����tP���U�]_�>M����e�>�
�	sa��<L��y��o�����+���:/�`���M�������-�U<*��%�I=����H7��c��
��<��
�K�dp��!�(������0�>W�������#�����x�%����eC�� R����K1X�5�H�|����������3�!���~���1Y?��{BC�HP2d�k��vv�8���43���������4�|�DDrl��+;��p}�WD$<��/x=��Q�����s�v*1��3��^$�
����E��b�g
JNF~�1#�KXL��S��r�
�����DO?�(I+�	�p6�^bL���C�����$$��cF��_����6�n��lF�� ���'!��0��m3��������������W���v��e�yOA�q{wd��U�ALy��;�%kb�4�������~aMQ���?��S���"��i�Wr�����	���V|"-���F������
~����	^����U�,��j����M�or���Xlr����?��?�,��i���4	$!�Tm`(U��\3j�>Y��A��l�������sY�IVp|j���b��q��`�
�C����/8>M�L�E��N�GB���l3"��H
#yK�I�]������\���,���L�53?��T��]TOa��?�u�����������U1���A����A�_��"D������3~
)��S�7��~vd	�;��Q��*�
��z�M�/8�(���J��HF���<�E�wb^��
��-�_]z����)����=����g��V��\�;^P�*N���%��`����$j�y�'Z?I���,S*=�'e��Rp6�f��G���u|�1�8uD��~�J���Cq��W�r*�T�[�A~pc��h�O��TQN�����9dM����S�T�$6����{��k�(>/���x��J���IK�7<]T�\��<�"����\�G;�-�o��j�������o�R�M0.���j�'g|��Gh~�6~57Nz���t���4�I|�!m��57I:Z����m�!��k�z�<�:D/G_�n/M������r��}�� z�MA��k���[��������N�T�>=�f���q���H�7�r����\��h�a��}w��g�\H��ct��:���fR���5��l����?�_]��������I�����
t�[�!����Q�.��=�^�sC��eex�$~w^�R�W�=�n*��a��=wL>6ws��;`r�����w�����?�$��%U�,O��)Sp��������O$.�}�-�mSs�\ ��HI�����Pu����}�Rt	VF���u���.��r'�- ���.�@�]��:��m"G�m^+z�����Z
��X4�$�CYP��w�������T`n������>���yZ�\X��H��_E���
��N����6�G{�����q��=e�P���w�7_A}�6��g)�o����F=}5�e��f"wU�:R6v�q���@���TF���XG$�i�
����ca���KVM�DS�\A�@�
�%��y��4�L�������yo������R����H���#tXs�d����@�lr4����aY	�I���@������(��r��D��M���<����j�M���kJ>��{�m;
H��D��(�%6�x��@�jI��Z14���^�M�[��i���@k���#	���v��y�y�z������s���>��&�K�����Ev��+�e[�%��&��}r������cH�h��-����t����/�+�b����O�j�4�����9�{3d���A-���h�eR._�*�.�L;5����`y�����)[��� ��/���1�h2�Xg;�E
r
���=mv��F�{}��HV�@�h���(���q��Fo {���=��hVv���T�G��������)|�� �p�~|$����u�M/U�kS���C_�W���C�P�L�xc�gw�_F�]TaC��+��~�lY��
g("c���8������3���/��>l������[��>��`� ��S+�"�B�+������h���^�r~+|�e�H��N�h�dy��x��^�D��/w��d��(�b%��2��	T���A����� ]\7;��hP�yI/��Y�B,Xx_(&��^)�y��������p!������S����u�,y�����g�hP���~�����%?>'�Tc��d}2�u�EU�[O����4e)�~SN��-�0ai�r�����4���!{��l���������0:�H4e(��6�C�2h�.n�9��O�0=D�����B	7��r��^$�S�����C'�#mw�+�����@Y�<F_r)��f*>�r��vJ���
�1E�e��@okW�=�����w;��ox�>�
�Lw��[�c��0
��p@n�9v+z�v���'>G4�R��Q3ByB���x6sHNQ���[���6���%������o�w�4��K
]*[!�,�Y�d�P��lD���x!;$�G����d��w��u
���������,3��&���_!��q��j��W���/��w1#�����9�x�i<;�7�K��o�����S��R33+\��*t;6���j�-�r�����>���>��)�d�}��(Ag�t2��Z�7��p���g�co�V*��0�������G�����Cr�G��a�dy���;mo����S�uC�X����E�����W��(�������*dG�[�������i�v��\�������^��Y�������oz?��R��u]�yo�EwgD��L^	�@�,�O$�n����mP	1���:�k��ZvH�&*	.�Q'?��U�m�SLImb��7O0f��U�%?�3�F�(�]a�5����T\C�Y��Y�6����h�{d�����������V����o���y��u�[�U��S�0#���2�$DG���U�1�'��p��>��o�+P`g���Ie��	�|{e�����@�PW&��}��dG?#���x�U�|�Cr��Q1�A�K�	=�kdmazmF� ��1�KW
�Fj�����x��1�Ml�m�k$�k��"I4F��`
���R'��`���Z��d\{��j i�cb>C��b��^��l�'�����
F���1��7�����q���HcQ���\��kt^o\cna,���	����v�Z�yA+����-�{4R4��\���Q��
�Po]���n5�xm�\0*{3H;���9���[g���`*���������<�{R�:u��~�R ����PQ�V���f��!�E�{���kB���9k9�s����G�At�t�N�������rXp���dTvf�:+�������7o���~|�5��E�S���[gUD�������*`�(�9�����l����}~���y21O6Lg�t���)�P;8p7G0�h,��q3gj�v ���cL�]��S�'_���@j�����9��!
���(�a���
��T����HQ����qn�WO@���Uf`,:^!ZY^KwC�������f��j:���	���b\���W3o'(I�@�l"Oo�:�9P����.������D�x���q����nQX=n�1��F�B��=��Q��T����*D�33�XR�~��&���K��q�����\�dd������>�`��>��������h}��Fs2��n>�}
:���A���2��6 �}����:#��U��xWR�����7�gE?��i\l��!{>n= ����.�
n_s�\�������������c�p��i�
���Z��f�������l4��@�������fjv������B�tF��F�e�#��(�;B�P�6dnNiy�*�SJd���%��2ZiI^��k��x��8.#�b�f����������Gb�U�yX="�����-�����2���$�h�/�{�$>8^/��"T�\���e��c�V���a��V�3py��d:�E��)���G$����w��;�]oHmp���YL���L�Ju<�L��^?�-:�xj�e���L}��v�F3 �n��i9��d���E�E��&��$+5F���_���vq�{�8,w4��$,���W
������r�C+:�������:��%���i!���S0b��J+�|��QA�*nT���������|J��t#4K�X<_+�|���*�ym�o��3����=^��$}�:�\�]�]k�V�P[n�_r_���'J�Z���6 �mOeK��A��Q���;�^H��R�vMcQl���^$��'�~7c�F�k�}��
'%����),������'�������1�;����E��	$��>G�����(����!�-�b�\`?�x���c���b�����3B<�Jg���Zc��F���9�����:%��Y�'���!0��R�}��c��������T�����j���9Y�{�[Hz�1���������V�I0�3/�t��m������P�I�G����"�u��;���#2e�+2�����!�Qu����z=�x9'�^����s%&���=��N�����o��D,H�&Z}Q�Q@��"DQ�}���xb9�:������z�G^�
T��G�:��pN���b�����R
�G��|����P�H����I�x�az��[�tN���b���'����bl�D/H�h���*0���C�-L����Qg�WS*����>������Q:�����\8T��+���L��[cV�	����4wP��zaQ<�6��	Q�����#�`�Z����}�5�#��U���X4��S�9�&<0��@�����hi��mGf)"��"K�7j���Y�A�1yOQ�kot7Es��B@=EA���f����������0A �`Hx]�����~�X�m�<6���?��T��K�����q���	� l��\�����xh]��Vi{P�~��v��sWR�a���H;:��h�ZQ����s�\��#��!��WJ=�&)8����P	��2\�@��H������^~��\Q��:�
�.���B~IFm�w�tD�V�/F�-��zx�5��R�������yvZ���Nv������D>;nn�*����n}��AOHJ+�x�%46hy4'��������_T��l�WY�1��O�Z�=�_
�7�2k\��3u@U������	t��4������fR��)��-0�N�v��V�b�}F�@�H�c��~�3y]�6�K0�=�K"�����Gq�"s�y����~AUF��Wv��� v%�'s���!����w���^x5��Dn$��=�=��{/��?�;�R�*��]�i)W�j�>����Qy��"R��w�
Z��T9S3}�&���,���QD\3�-�[������G��o^���h/��GF����A���_�x�6�|~�����s5Y��X��@��P:+��$����A���-W(q
1���i77$�o>����j�Q^R2~.���D3���{��F�_�I7B"E/���L���2��.�Ym�E���_������nW�`	Q�n�\�j(�|'��G�fg#��1�a!���B�J�m_��{p�[������,�#����������1y88a5����t����-���b,+��e/\������r7��������&�Hz649��D�_���p,���4~-M����o���I���+��X2��~�������yO�7DFK��������4��;5��>;��P�BR���[�`i�:=�������IU��Q���vP�"ZA��2�T~%�r���q�����^���G�H����Mu��W����#���������,��_�
���U)�K�r�V�����w��:f����!����w��J��}�e6�^��A
����m�d(��[5.j����M+��K2-�����d<%ZH��~E4��?��+^v�������#��t�>\!a����`��u��:�9}D���1_A�q�fTg���L(����<���Y�+
�F�T]��,����3Q�ND�����y�p���,���S�����'����x��d�q=�X������b���.BbE��;������,6^!+Bu ����*9��\D�t��$5��fx�����@����PV�%������*8������A����cC�|���y1��b�i����S��������( �f�d�?e���S��4T���������,�#��O�s����������4���Sv���*��cy_�-�7���WE-h���)�������z�������Q�����E$E���j��?4����*�A9z?WyO3:�$|Y�1��<G#������F�m�=�����
���a���d����������/*{�P��.��i��E�b����^���S���U��C������=:`���DeH��[B:u~��fc����������jm:��0;�|������`��q��J�;�7�Mz��������y���v���	.w��~7+��������5iY"������,��_z���k�n��J#,�[%��}�p��Q��,����a@�6#W�����bR�V5p��:����{����K��aw�;��F;]����~�}Y�������+%�z����{S2�~3���X���'Ll����C�H�F f�����$9�}6�������������d�l���W�����~D\��Nn��p�B{�Y��(ff��d��Z��A�%�9I�;�$ijV�?��mHK�������gtI�*������r�#�\�,���� �vF7�T���cr�et>z�z�Dd=������O__���4�l#��Nw��m���"v�uz;%!}F?���xw>;L� ����-/����')#��6<����N8�
]^c��[U�?��v�#*�V@���2�������`o���?�P�o�S	S{=�M>o���\�ku����7	���t�Z�a������V�H��Sf�����(.��+���9�9e�P�TVn���x�{����)��
��a����#���R�d�����cF��a����$0���[�����������$���j?.^���+iS���F�4��	�R����c9�[�hZ��l��b������ca
7J�^�J����%���TO�����+���J|�T����R��qW�-�{q�����M#u�mj����Y�%@\e�+����!��6 �I����AuIJj/}~��������	�`l7}TH��pY����5�#~��e��K(�����wE��M�J
p1�����2�!/�n��u-j���
3[<���c�n���y-J�w�`D�l�`�eV����O�Hl]������

����P!�|-x�C-�4�Yx�k�
�q�x��8rJ0�(�,P�h*����PXu�!�X���8�����'-���o_?D�B�>R@��%��(���r"��`���8��AxM��w�e����Q�LY�W�>�L� B`���"��M����|p�����u1�By+�<|='$�o����O��-���'b��k,��]oH����y�_$p�DX��'a,�u��z���jgd\�����<l�*.�+�W�?r��$���Ww�'��'�����C��{O��o��k;sy�t������F��G�3���_�::�~8-��//�J���Y��Y�P��c��+�f�`4�d����K�������:d
�]����g���d��(�h���������o)�+�[9�Zw8���j�n�����`�U%����'"����^������"�^|�D����W��h���������!\@��3K�gn�YT�#��z��687t#,�Re���7���l{lh<�IZ�-s�����a9@�fA��&���L����+7���yh��d��(���yww�Z
�X����!Pl�>����ol~�$�N6�z�����������@u��~'�������@^D��6Z�
�$r���;p�M(|/��c���e�"�@f�s�Z��Gv��Y7���.$)+���=0�^�u���B�������~l��&����FP�c�\���y��)0xD�T4��o�����9��
P-�g�e,��;�����s~D���0�0��`��Anp��/G:��"�\�O�J�2��D
�+���S����;0N�	
���"������{�e��t�R�'Y���������D��b�2�T`"��82����~�(�_��N2W8�*�{*��y^�N3��\��\���q��W�]�
��\(t1�%��'&�����rg�M�/&�Ai���J14~���F�K�C��}��<�eYK���[�+k���,@�cw�d��;�y��j�t.w����L�i_�����,�kG2/Xz���VL�:f�:��m�!�i%g����Kv3��>�g����'k���9<atJZnut�1V��L���D�����I��T�������+������mbl����>��+*�����/�o�w�.px+��^��
�PK)��f7�8PK%EUT7 res/gif/v3-2socket.gifUT
�%bHb�Gbux�����W����������c�4H���(%DJB@iDEE��HG)(�()A:��ftH�N	1u~^���������~�q]�\����L�5�n�����������B�������deeUTTttt�������_�������w�����P�������UPPPQRR]^R���������������������477�U�����W����,�W-5����?��*�/��+i�*i�,��//��������5MK�5���>4�Z��ww������|�*���|,�-+��-��(��,��)��+Yl�Y��Y��Y\����u7�V;ZW�Zhc������������4Z��Js����������������������>*���_��,.�.w4/���L��,5�h�++]]]��i�m|||~~~��c��gs�w{x`c�k�s��d�����B�����������������������������������������������������������s����]��������C{�c{{k���G��G������������������������������/KG��[������������v�������n}�\=����{p�|������������������?�ON���O�}�v����rz���?��_�����*���;��@��������[���3��U�u������V�z�6�0�.�K��)��W]8�B��?�����+�{;�}5E#�1QpA����|\��sm�x�e�i�i]�d�}��B]��;��8A����5��������V���*�C�w�-}�3��7V.S�`��y:����2���:����c������
��=L*��m�����xV7p�x���,2��M��|k��F.��Vl�=�&B��V��$P�b��'��5���*�q�=E����������d���/Z�[C�L/+�����)������tcH'�
������6`�5}
3�L��Y!����r�dRmr1<Ec�Um[Y	����1dh��X��"wc/���{`p<�DC���K �������jR%m��@X�~�<��zoa_�B�!�n����G� ���I�]u�]�*	�)U�$G��+�B�\s3�GV��oAA�`�-o�&o9���x<)U�d�W��j�����)���2b������'c��a��,#J�.M�*36���
��9P��n�32���$��:�'���K��@�����A�A�"k&P�6����k���} ��1�r�`�|y}MD�0���'kxJ.�
\��1M�P��������L���z9D'y��GE�rqqo=A�u��>�(h��=�1��V��k��=�����?��i>K��M������e�plb�U��R�Th��R+�j�Y����gr�'`�6eAN����S��K��9b�Q<�������t����R�J�����"�`��|_1�I�	�5��X��J+�����������������k��Sp��kZ�L��9��}?�3c�7�M���u���2��R�/A6y�^��`�������	M�q�!|^��h�K�T�;p?5K�"��qwd��#���!���A���/����!��:>{ "$��W���~B��l�U!Q|H�T��ih<�:T����@t�|�������]b��	�od�7�s���b�XJ`\���(R����^N�/�`�CR����G�Hf+1~b!y�F[�x��.9mY�����:���^����LT$�D��A���nR���W,i���Q�h��f�����;����i��}c�]�����~hI�������\����Ua�{�KDv�~��v��`_dUa2t�N�k�c�?����U��-�!��{V���>+{&	�\y��HtN�y&���.�G�)�?u�(�zo��2����]S���J����V�,zk��K��$�3��`=��k
�i�Bd.H�N�P�Iy1~s5u�&w�����������fs��5���1��#����.LYT{>+���V��=l#X�-��t�+;�h�{X/y�d\�L~�<,����1.���+�h{2(A59�Z����@�6v����zZ���6zcl��U���-���������Km�g����p�`Y��nt���}=$d�:R��S��30��_���Z�JN�*f�����j���rN?p[�����L������z�XF]e��"��u��UC��^m<�U����Z6���a^���������l��9{��6��ruY���A:���[�������~���'��.����*=hU��`&��,������t��w+��t�uj���d*����|�qi����,�p(�f=�����~���g�M<��0�n����@?kS2��Ls��Z<����e��|<�
\���f�.����6H�)iB�dt�����PQW��qCM�d��ef�t4��'&�sz�u�[�����u��T��{����U�3� A^���q��"�����e��K���:^�i�l���p��9� =���-L������t5~��D��V�(����v�7G������<6��=_#��)���"�>��"��u��S��\��MN�'Von�x��]6R�>R; �j(z,���u�+k�ZC?�(f���� ��L����S�h������e��5!������W`�X�yLj��C0p�r��x����r���	�H�vG�TR���p�������dlOPg\a����/0G�vn�.�A�1���l�;���k�g�W�_�x��U o?����D#��^����G�B��-?���Zm�����V
,������������(�/���������V���n�f>�eY;�y���u�O�\D�_-�-1s����.������~,�������g�t�G���_]�E��AaP��a��!s����W����l��E;�<�Z��n�.�j����F��B�0.�]��.����q�����Q���E8�J�p�&�	��5Ia8EM�	p�
�����2���t��{.�p�T^�����N��@$�a)��,�t���>V��fdn&���g0��4��������p��f'��M�e�4�?^7;�p�&c]N^h)��H�x�kJ��_���Go��K{�P{��<�fHI'$2Ol�fl�}����������Y3�9'���<����7!q�q��qog�*N�*y��5�k���C�3��7����8�o�I��L��O�	I��H�oL�I>I��X�+&N��'��31�}���>+�s;SX%��}��w�2s����������F�����^����~��R�RqH�p�?�k	,�#�ag~�%����~q��&p��|�jE��!S���(P�������HKH�)�������	�yoO��@� O7���2~���L�uC��)=B��:1#vK� Y���
�>���+�+����q@����hj�P��I����s&H�����y$����J2���'&�Z&fZSI� m/wU�bD9-!6�;�E�W�8�f1vmH��W���f:�>�Qz�z�y#C�����I���-��S����"��F+E��H�
�� \�x��s��;@����_%��j�'�������ge�u�n:��SUF�l�Y,��	a���R�|������(������fd{f�l&����Q��DL���T�(�,&��>s7[��8�\�1"i9*b�:��=BTI��1y�R�l���+�����5p�
�}�����Rm�A]�9X���h�����������<���-�E,��6�f�4���A���,��6��,C"�L��G�]2_���S����J��>����|m>���|6m�0.]C���k��r'�����"Pp#��(���V�����kQwd�����Mm�R$�j!LF���zK>�b��J3}�v��"A�>0�1b(Y(+*5
O����N2�E�����xQz���Y��eN%��4�]��U�����8,�kJTIZ*��+�����X����Y�6�M-+?�)���M��a���E�V]��#��/�l*_8-,�V��
�<��0|�ha����izK��+�0
���@��!U�!x����j�S��PG^XY%���\������:�E���+vl��v���+��M5�T�����.���^��pd��;c�-���jb���N��s*�W��2����U/P�k���Z�OU�����!)����v������������*s�Lk.~Wd8T<���������� ��29'��9�#w�.Cu���-Yl�%X�D��\��5����;/�������
�G�5�!/ N
.<�o���xj��r��+�� �#KV�������H�;~��f�bY����HYmv:L�t
�+�w�������"��gE
#� A�b������/1�C�N�r���{������E~,��V0
��m���r�b���?�V�"��!.������,b�����?��x�t)������l�/�����������*����Z��*�/�&���C�*k���M'1�.p?��o�A����h����7M2������"�:�X�aI����_e�i�\�f�z�y��49�����"�p��l����I�\��0�}��<��5�����w���uai<�������9���>���M�y%�{��J�*-���{�����s��Q�l?G���C��a�]��m�����M�~���o������
�J��j���?�����P%{��UC:\#�/r���J���9�f��3���XY�h��8F�Y����:��i�XW���Yji#��#���K7�f�p�3�Uwh�5p���E9�r���d����
,�
�E�����;�I��Ie��dZ�����)
�+���:����kf�k���,���x�	LU�)�������f&��J/+U/qd}�ad����T�H���u�mO`��5�g���BabrG	4g$$�2u�Oy��6�N��i�6�N�mNpC�R��c}6d%�
�R=��l1���z�y�t��6���g#`.�p�/��#�a��x������Y�5��3�,~�zhv�����fX��N�}����|���yv�9Owy�=�9���|N�M�&c�ez2q�ra~aa��MpqEwq�yq#|q+{q�uqoaq%����a5�9�KiP>��\<d��3��1��A<n�o� ���_@����}�e�e��2���|��n��uAx�4����L�
G��Tx%.�������B�2�?��H��[^hL��2��I�y��KZ�r���t rb�f9�|.,�E���u���~vO)�@�'������l���t� Xu����(ydI���,D;	��f.u��3,�rJ`(,�5�/<�K�8��	ty�����e�wA���hf�'�����I�h&)(Ai��a�g?�����
D5��Q4�m�N~<�	A�^����O����R5WE�]f�(�^�eD��B��^��A=��}����R����]��.?��K�}��f�����u6N��O~P��YP`s���B;��/�K���m��L+t���~���gI����;G_���.��37�
����,w�*9�v�"[�>�%.k�
?N��u�9�����q<;�V��RwK)�^z���������wd,�-�&Q�U���Z��x��u�����Y&�Q�������R����E2�9ML�G� ���f�6L���u�Z���1T�)�#�Er�X�����p����
�������o�	�>�5
���l~����%$1��<��j��=�IS@�W�K��,����=��2mj��Y���"���Z���p\ �DIO:��]�����t��-��y������l"�V�iX�+������cg�/��v\����7B�K����C�w"��m��p��{����o<w����������v��7��@D�&zF�0�rY?�����j�0���KWFU.���;�1�n��)EZ��>�w�pgf[��E����B8����&�L������^^��~ca����8s������l����������9$�ot�����6\��n����jp5E�bR�P�	&�����~��M�#�&AlA�LC�kO�C���q@,~C��KvY�����������?��$�i������7���k��@���_w7x}�y�v:M��5>�8�~��pN�Z�j(�3i0�w���������OMSKL��������1U��LU�j9*/W"�����)b��dA�>KL���_�`�D�=KoU�N:W{���#��g����E�-��K�Vy���$'~����Q��Mt��1Vy�Z��2j���jS�Q�\j���w* *�J�X|)�0���=r�����C��%]��j
"
���Z�`�I�L�j-I�����*�K17��1�*���"�z��e"�?IU=u_��#jm�=� �j�����T��i�z���"zb���^��>W?�f��M��40����X��iSR�x�����Zg�\�3�/�N3r�cPr�?=�*��7S�qG!�\��9Irn�6�m��p���;O���Tx���$m������d��
�����m�O���Od�>w��sn�R,��KW
��0�y+�����b�� ���a��O�9�i-D�T�i��`����2�������1k{dqO8�p���'��T��&��Qn�4����M�7*7+R��0��$��jo�=R����qG�����{�w�JN�!���J	i;�1Ui�V���ua�VK������)�k]��+����y'Q8
���R���3��[��!�[^��Z<m�{��K�����-���q����
�n[^��K�������w���>���b�������S���Ma1C�X�<�:��6��fC0:�U��&9��n�<Y��!�������,����l/
����T<����t�����f
+���W�.�$P��A�1����y:K��/;�������[�P��D�_ykf����,?En�)�F��rY�%H#��9W�:�L}mC���q���8��wrLjJ$�|�
5WyU~z��*Z�Cd�x��%Z��X�3u���$��)/*]g,#��oe1A��:q���ZX���<�
U|=6��N.qP:c��cix�	�?7X�|N��w��{�I�1,�1��:k��1��r��d���4WY��J���&q�^e�8�K������#GS�z�(�D��r�.�/~;����+�Z?��N���)��07PUe7/�"�Z��1���	��|�:�%l�p�@$��&N��������*D�����Sm�s0a�/�����M�J�����B�,`��M�m���B�e�����]��U���O��!����n7���9,�P��d�����u8AVN:XC��������rm�<$I��L�!
��VM�������e8|O�Dpc+�����D�q(���#����j��}�����0 hF}h�dx��E~/�I�kh`
	�'�o, �����4�����lv}�����R��g�a�x<���<w"g5���������`s�C�9�^0-6��p��j��nSU���������	�Vk���p�S���H����c`&��r�3m��oc#���B��������t�e���3"���$�����;��7e�C)����������>�x���ywc
��nqe?+�Nj)E��[�!�	�#P��%�������lb�����
���W<����rV���~���/O�S�x}�Th�]���.��l
I�AZE����?W���"r���dO,��ru!�t=�,��d�L7���eq���U�[2s�"'���+[y��v��/f�[w���<����J�w��.��`�g�/�C%�3����3�a�����	wO�����M��f/f�6g_my�
J�#��V(�4(.�H�����\��n�z��v'�y�?��0��v��\���%�#�4o�,�@/�F����0���>i���w��o��W�c���4�D�'���<n���/.W�����B��]���=�F�
Rd�IO\cw2�kg���'�����6]��=v^��x��j�4�H?Qx0��Y�g�\d;����'�`#B@����s�;6�����zs���������N���{8�@<-������yd���Z|Sb�O� .� ����r��r�>>�xV��h��e�+�j�0�<����s,y�����k����M�b6����oL���')����pZ��se� ]��I�� ��4������~�x��/�����:���`<�k�Vbq[E)�SQ�L�b����V����oRYB��E��7��u�c�J�G�H����'�eB������T���!�:�N�����0L&H�4�����(�:��b�Y��60���^���G�J�]�~T���������F"��������y���y�HoD:)b�&y�])�Se���}'`h�o7R]<��E��F����"�t2�"���
5E
JIE�@��`�T�Y%�����u�]%�P �O��%�E�u��`�	�������w�~���%	�a���n;)�
�#�[�������6/�p���<��p��1��yU�@���a���Q2�s{Q%2�*
$E��a����U����%m�D��������=c�w��#j��H�TxDE�*B�j�4N������#]�Fr�1��u(d|��~���B�p:����|����||�/�V����7
�%	��;��21&�����U	O�I Vq���F
T�7B
�"����#�(d
�����-��{�W$"�z/4��z/a��IR<�����*24x ��%�Z*�lQ��:i~���<_�eC���5y��t�12m��7��Az0k&"����������QS�����Dk_WA"���P��a���U��6jl�����/����8�<�Xj�+��@�J�$�e^.F
?7Q����P_��w�dT5bo����g!#Q���vdI�����5?m<H`^) Y���&��IfS3��aWX�-�e�F��!��/���e�Wk�UM{B�J�W�^�I�����.����^����V������fGc���W!�x���X�	+��0�Ka���}����R#]"�'����S� ��Y���J~+R��X���[J��i�93(8��Bt?��S�w��/[�-"I�&����l;a! s_�|o��d�vz�e
����B�S��BI��H.E1���,%D��1���/u�j��9�F
�������Xf��y��"�1�,�WYD���\��������^�����%�i4"�f�9F����?b�Qa:� z~�E�-W	�A�*O4�S����s�����vh���}�x���E�k��L�R��8UZ�����l��I����mH�go\���5;���tR��+O������f��l����&�@[���I����J3*���>�g4���\������ 
�wZ�����{!�����X��,&^���H���Qx������3��=��f����Y�u��t������6�>"5�W1q(�e�O���i:����s�����GD��pj����P�7���kn�6��Z����� U�s���s3�u��b�M��S�(��.<���
���0�������p�������8��nT���?��{�v�r�5�������/E�a��(��$���$�;�gsRu5I��d_!�VW/Z71	.K��w��d����z���\�g�����h2�02|�TP\��R�Q�^�nE����1;����������f��p����	�u��[\H��m�
@��w"AHx�x��y���3��T!\L��G�G�n�M���-UXtn����g�>�������ca�P��&B)xx����� c�7����B�A�1R.����}o:�`�(���=��6��1�8�~�����\Y���3�+��+J��+�����r��
�f�$������M��[N��Z�E�zre�3K��Y< ���Z���)��D��V�����������Z������e��R��Y�[5<��`���_�_���������6niUA\�w6����$C���|�`�a������zE6��G�7�,m^���g����#�
��cR6�T�k���-,���x�9�G4��_��#�$t$^^�#�\h�]lc��y�5���D�����_��9,��5�4(��1�����iT�}���'�
e��_���������z����N���a2u;�\g�la�4	�N�s�H8���*"����L"���D�=�����,@u9���4���1p�U�?��	M@I�u��4����D������������
�O�#��=���CH]S���Nq�������]KKN7��v�7h�A@���1���t�ro��O�\�V�h�|m{l0����j����^��#�A������$�����hq��:�R/�����s���>��:=�,�H������@{[2����w��{��d���HD��
���$�!d��D�#:D}T����hR~0����Q��g���o��(W��wL��bJJ�W�	�-��c�agAR���+���Y�urE�h[�t��.����O�%���/U]�-����zr71�:�b�[/��u}#02�x�n����N�����-��Z��H��|�qb��Y���G���05��O���U:��q���F�`_�����t���/��5��%�iq�!TS���e$	;��F�=���`�����w��5#�@����kO�`��`G/[��������1u7�����������,�e����3����� �}����(���p}p E�S�)�V��.�N�2N�2��B$d�DE��?}����OA��Fo�:_Re���N�U��n��m�	��j��#\���v�<��Z���S
���O����yy��"$a7?���_y��p*�So:�=y}�W���x����������3!N3x'Z�-e����j�l�F���\r��?~���V���(�>-�<��y��)��+�g �hGy�0�Up[�'[������.���N����xw���qz_k>j+jd�w�Dh������K�~_��&����������G���"���S~6��~�~��9�mL����D�,�a�)a���[��p��E�ba�
A�N�_��.����ns�9o7��y{�����.	G�����;�b��a6������vo��"������%��dq�4,��������)��v=�iM*Q���tZ��,����>Dw����[�"]$I���G�=	|w����z������*���;3��/C�k�����p����{��k|id�A������;��'��}���$�|���/����3���R�a^1c�V��_yr��w7�����\x�F6"_f!���?UX���V	4ON"�&G\���NG7�����-j��KF�_9�T�|z�����+>��	~��B�f�/����������tC�L��R1�r"����������F�b�a�u
�e�t��������<om}�Y�~{zN���6�u�?.�)uF�WX��>����f��xB�8���%�)���-��f�bV��&(�7��P
`�t0���3�S���}������sf��3#��t[�X+�P�enl�i�|�����	�5��&�W���x�����4nc?��KTAc�1	J��&�=)�QS���V�A�!�����>�V��&���G�Q����s��>P���ji�2e�)<q������+�������ye��;9rw[�F	E�g��~���D<C�p������|�~�;��������|����������e�_)�~�����9"c��i����O�B;\���X��_���"��iE<�t���C0t��Q�}g���o�-����/�A�%q��_[�������D>�t
��Q�K�
A�@���PUd^+�F�������V�8S�E�XWs-;�V^��sH��Z���@�O�D8�(�FV_�o+KG���o��+d�A/#Y��0�L���zg1?b��n�e%��h��e'�K�`I/~xoUi'�"����#���b��/;�a�B�"Z�D����H��-�n)^��&�.�
��S��ng� ����E�h��^��0k/l=��m���q��?%(���������#�A�SH.�+���R�~9�!��AgE�mpP���3w+����aR��J�������lu%Q!L��C��w�2Z}��[O	e&�$�1�����$cQ����s�U���DQ�R���U�4��$�����@�@"
@��P����%r��r{T����#zSU��%X�"a"L8�]4�~�����cT�Q�$�Sc�$�(������T��h��\R�Cxpq�������D���4E$doE`����$���
�����IJ,�>��3U���[�RT95�e�hP�(s#�H�}����pB�qa*YLw�t���!+g7Uaz�����6��dtA�.#(�����w�j�@FE�������i�8,��0��3G����*�g-��k�k[�V�2���2��!�yn�;^r������c:H�����NS ��t��.T	��;�mi�J.tdh:J�z?�T���nnt4�
���w��%R��u��gi"���]��v2��>���p���R;����?���+>G?$���]\��E�������S`�;�M�����m��!|���D�s4{#��8LIe������_P'T2]��_ps��</%��x���x��/r������rB��7^������ps_J��yv�,�'Pf$��a@��{A�A�+-�����^I6������N[x��+�����).�>��*�`$��=��T�$�-.� M0�%$�6���)"����5DA$c\?/E��h�B1��U
>��q���7>�#"�}���w�4^�s�t<y9y�kM���0xL�+����0���x�����U��pP���p��E=|���7����I}��K|e���\>W������?���*FKsT�'�<��\�����Nf�dn��W�
���o��D��}���U)k�
S ���c��M��'�JV�!!�b(\D&	�r������s���g��=p����Z�Fm�_�HsrS`}w����Y���{ry_L����.�R�m=��O�k�-�k4�?i�,Jc�
��X��i/��Cc����	s����O�Y��>��j�X���t�������+R�q��.����M�/� A</�p4@!������=��[��"�G[�ZDO%]�D�<���`1����OP��]1#��]	���cF��x�����U���Tn��h%���K���k�3t����`R����Ga?OT���
��=Q���Y�d�;q�W�m6�}�n�JC'n��q���o��3��!o��t������O-</�(�3]L�o�.�O�+<	�Bb�%3�N�������7�"=�o�W$�#YQB�^�iX$E��)fc���0��-q�F
��qQ@Bh���fP�0��!��e4���f�EaD3=�0=A��)�$43�s
s���@��SX�
1p��PK4���57PK$EUT�0 res/gif/v3-notebook.gifUT
�%bHb�Gbux���w�;����s��2c[
!Qv���"Kdi��:e�R3��Vc	E%)�6��4������"��j^���?�{�s����s������<n#}U5;0���`0!!�'N����;wNSSSGG�������������������aaa111���iii���/_�,-,,/*�}[XZ^�XWX]QXRUUYYY__�_\�ZT8\R8UZ��P�Z_8ZY8]Y4�T8�X8W]1WW��P��\_�Q��SX:UX>QX?VT>S�<��ZY2�P9^^5[Q7_�n��v��a��y���U��W��^8�V8�W4�S�=Q8=�vd�pj�p��b��1�W97Z17[9<Y�0P����4��0Z���fq�~i��u�ya�zi�����x��h��r���\o�|_�����*f���5,�7,�4,�,��-N�,,4,-���������}�����^�l����������y�}u������|���������������O;s3+���;�W{{���{�����l���|�^���������������]]i[[��X�����\���^����������������\��j��Z��r����������������������������_���,�]����p������������_����-�8�z��m�p����w�������poo�ho�h������������D������3��	�BP�����h$��c�E�}�	�����D�9�����t��*������:vW�������cT��?[5��Lwe����h����7C/
���z��X�4~j\�v����k���h���� bR]���s��Sou�t
y���IM�lC�����������i-c�
9���O+?�-�'��_��C&$ 
.�z��<����(�g�0�����g�W�S������Fwxr�����7*��jY6���iQ�q��A���9�K�������;R����/s �3�a�����#sgTB������3�oAmh�C(\Kz�@|���i���F�Z�B��3���xRtl��w�(s>�/Y���S�!rR
�8t�j�4�
�}S�z�Eb�p�4������p?~���Z�Q����5D-;9�W�n�sW�rsW,�&v�/?%'G-?S����L��{n��Hy��1y`kZ�8�Bl;O�z��,Z�F�D�L��Y.�5a�9UP�
d��X��Vh�iJ��G���\N�fAm<Z�"�[z�{\�����c�h������������n��
�ZxS�Z������K5J����(���h��������&u�/�����Or�"2/�F�*EX����.c��<$
��`��9����t&P��rh>z�e� ����a��q@I0&���<��>��c��>�H�q5��v��Q3��A�����b_��6������q�i��d2�>2),�s�{N�,Qb������A�v��
�t�������d��]{��C�H�fpRC�R���T��k���W�:��
�K�]9i�E~�Y������f��O��s���������3}@
��=
����������<����n����p����������;\�~�������4��O�����V�s�Kl.\v���p��bw�Cb��i���7p���~�8v��F�����d���\�� �f��:���e�o�cDH���p�4-�7]X�<G��g�Ye��nf��O(�b���5
��8������2���������������r��!'<���������J��<��f�c��y����iiv���!����m}:�f7�x���'�0jA�AsP�O�v�IO8X�W�����Z��,8+��\r�
�QB"o+TDwII�R���u����z�K'fp8���K�a����9I'�\��s�@@�R�i�:|���
������5i.g'�u�Q��( ���VKKB�BE::
�h/i�m�+)K�����������f��bG�>����$>�k�_�Q��<PGD�F��iVR9�h�[�I����������������A���w
�C��K4�����PU ���_��"eL0�syc��LS1�����wJr��zK�|�)<���v�����F$��E�(Z.�]����a���/�t\.K-%O�+�1��������������&Q�hrQ@�g��7��6���dd���}��IK	���)�����N����/�����i����F��af\��@�BR�����av?��KvK�x��6�^59��"AL�����W��'(PC�6Gh���$�x$�m��H@.������D���ji�����1P3X�?����������BHd���V�#N�o�#����QN30]��>l�2/O�����p�!��<��tYp��Su��AH���������+bB���+R3�VNj���xp��xjU:@��n�l���xsO{nI\��I�����<��DI�F,t����Q~�����<[c�]������u���
�#)��Mui�a4_R��2�r-���Z�UK���K�����G�o��aw��.�1T;��l�F8?�X;@Z�!��f�7���>�c���� ������@��r�=��&���F����Q�d�q���PG�yM�������Y�����Q��L�1���y��K�t�@����T&|�R�Ztz;J�q�����T��`5����~�&�O���>����	4m��^��l�N�I�=(�Ft'.D{2�)D������=����H;�2 �%�T�c���-�?����/���P$�����������z��<�}�	y��*����6�����<��	 ����c�U����>
�Z��<�E�Mg,����g.��+`��l@�2�$�P����V:�JL�S��1�Rz�,8H��n��E��,�@'�q������2D4����@����F~ ���`UG�y_U�i�������|N
x�v[���4�����d���U�����=��N���[B"3�%Z�����#
d`P2��:r� �WK�\A��
��F|��<������� �(�a��+&��+��|�D�=".����Q@���q��.���W��ZF��1��s�pS�n�g�80�����"?>G�����F7�5<�/�T���Ai�-� �������L��cJ����bL&=���-�oJ){{��Di^~���I���:����hK.�Q�{���q���]�#����w���'�����!�,C����`�^s�:�l�z�n��$�����`y�"K�/������<v����ql�yl~(&/S�z����S�����^��Z&:��T`�2yx�N|��D�-�S��|�k����;���X�fA���I�M�%���t�9g���L��$/�b��@vo��Z�\�@���zU�F)lr�����Z�Yn�{y�L�2�^gA�=�/���)�I>���,_�j��e�����.Ja����x�p7��@�����'�������(�o>U<;���{��:��\@o`���T�p&p�N���Ca���}\��zad�-�Y����>uI���C��d�_�7�|�������7��.���*��g��K�����rj�TE�Y<�y�U�����-��Hw�������&��.�a��������
�������2)O��`w3�~�%XEL�r�n��WO.�W@0b��P��P����P���$�x������I����mR������G5`�1��*n�+ik%`��N���PC����|�%��|���0$%�t�r��)�/F�[��0�s85���A����%�_\
�<S�
�;6��_��
x����y�r�<}w�"L��N�C,���1<�����������O�G�[gd&�{a��M�+�R�KC�s<@��d���N�yta��K�-���lBJ>�f�%��cyK�x����~(F�i:��P��(
���Iyx��,Q��.��o��p��=}lR0aG���\9^<��(��<y�w���|y[��Ut����v���&�P� >�]��[�S��{�.nS/�Lp�3��W/2�������X���]~Q����Z���.�w�
����l=�����w�2/�:z�QO�Q}��1o*��-�sT��@f3�
�i�oV��K�����/��].�P.<]~�w9A�BL�B���dx�Tv�tC��4�B�w��P�Y��h��X3
���U�
��ODB�����5������n�
LI�������������z0R�uv�U&�����1����@���m�(#�s`���d������fK@`�L�Yi���-AKZ�e���H?C���O>���P���L�z�0�G�������5d��2������Y@��p�L(�G�������H����y��"���#M�����V���
A'������".�A.g�1��2z�������X
G��C'���\�l�5���K"�n��f�3�A�� ���5��?�5�� �#�D�0��L�&�&V���
���Z��|�d�\&�#�/$@�X��'C#T�����$�8L����):�}�%b��v@�;-�I�"9-�tz�M;��0�����t����K��E��1(���}�z���d������bYL&t#�k(���4:u��l����
�"	j� ��P�:�s;mH��d�����[����k�9��Nk�K�����|��p�LLf�(Z��S��Nt���M���
��Jd����t*�
�J`}dD>��#_��\������Ma�o�}�P�\�4[�%,,��7��9	Ge�P��������N�f�/��l���4b���7���3+H��G�\(�,MGxl�^���TF�s��_���������U:�z�CS��.DgQ���(����k�H��o���B��Y������������_o�=>��D���������H���j��%�������G��%$�'���x=�#�0��>$�3_���y&l6��2a;M3�h��H��d����>����kh"�*�h ��q[��6�0�_�8�����I"�|�����
By�@�H>u&��;�S����6	����W��~Bz�FAP~������`)
���w5�7��Xs��O�o-����8|r2n2�GdN��0��$Y�`�����(���~�������k���l)��a��M���q3_,������K,��j��,�[�w
c`��	�!K��^B6&��"z}qe�I�E�n��$v~���"���xE��(5I�H��\d�~���4���+eG���B���K��K�e<�o&Y��D������M�|M+��e�>������q%�"����Mc�>��l��J�{����� �zK��	��N��C�����`�T.8�,#?�g1��@Q.V4�����D4��X�g,��"���0��������H����
�[��&���<��%���nZ�@y�OG��$(=��z
n���"M�DnF<[C�u6~0�;�l&���W�P�
���}n�s��3�|�dq��d3y\
�e�z�o��5SB���7bM�����\��-v��X�;}��1	F����X��+��p^*X6������6c`j���o.����xF�S��0��~9}vs����c�Lqh'��\7���������l7t����7�����Z�[[����&l�����������A�G�N��Og89~d`t��B)����{�B7���
���@� ��"o�
;���
O)��i�^�U��U��<HrK��#���]�X���J
\����!~|;�I+�h��x�s�'
L�0�(_�;Ge����/��/b}����Pr���~�f=)���U�w�ZQCrg�;pT�z�X�i�Md�.��b�]��v�T���Yio��u*�#�����:����3���p�{��qN�>Cs�+�0�����!�r�bT����s�	Xv�g���xl������*f,���b�*a�L���U����8^�\�l�D�w���U��&�L}cC�TF����5��U��'^Vy��OmJ�;�(���W��	�n�1��>�O��
��|��]u��\���������6uj���5��zS��!�[��������q��v{����n��"W����#�
��(�������3��T��������1_���Ks������?��Yv/���?E�����Ic��g�qMu�>������w������^��q��?kyk-<���b�q�B�3�I	�D�Ck�8��J2����e������h������$��/��a%x��R:���g��x�����9sr<��a��`(*��:2I����l��\�����o���\e�d+��p>���W�^��D��{������\��@*�r�++�S�!I.8���(�`�,��y�Zy������%�����nk����)9����MG��u �3�\��)D~�b�
���w$�K<2�R��/N���l��si�>J�[�l���S�J�u�h�cC3��������|��f�[�������d�+�>�d����s�*`�����J�o���2�z�6o����	�S����&��~!�Z�\��mQ0���F�,������y&���>��G�8���x�,0]��o������J�������#���D_�0~h�����(��%��R�[�s���������2:�-�[�_e�������I����v.%�����L5�����2m��}�M��.	����%\���)0��p������i�����������#?�0�1�#Nq���G�:\�Q�O|���H	_T��t�����g���s��h��]�������>����.2����5N�t}�+�#�]����|���\����{A\E���C\�����f���)���$�d�rZk��^[��"�����F�FD�{R������8�jS���X�M��,��
~�s�e��!��q�f���/f��h�
���H�}�<JI�p�ZR���v�q�w����l)�V��D-�j2�����}��n�� �����p���h�I�p�Z0�M�3�MvH��!����{��E�����y��S
"�7�
��$���;���m�l�f��:�b��XG`t�8�1�������	�c�-��G����V�����Rn��[	f����,�n�%[�s����b��F�������V��"���:Uz1�7))��{o��C>�e�5-����J�`�]���RH`��@�I��a��S|$s+�i_��LD)_yO+Pis���4��Be��q$�vS��/�����_~�?,}����[/w��M�^h������6Qz�^}���8�k���O79Y���w��i��������o"{��SOgkD���
7���v����������r'���%��<.����w~#[�E����2�}p�C�)����*�Tq_���cj���;���$�n��J�p�v��-�B��)I,�����X�mQ.5���8��P�����ZjHi���3:Y������0�Kr��/��RLO�<A�����~�~�/l��m���85�y���rp��-a�AW�1��=��>WS��0��T����V�=f�	���l�Sq���/��O�^�y�ae�/�J�H���������nN4����A�A��	xq�W4�|�����tX��Q�#�/OND��S$z����K���s�������^0u�����w�>-�����	����zN�����]�Hx�p���F(����QQ��t?3{�R�e���}/
����]�[��0z���i.�z���i�w��&g���z����p�I����5~������=U���8o^�negF��o��P��z�+��D�-��������H|�	v�~n������/cl�s��%�$�2�^�F:�6/$������W^|2�&�'4�������J����ugT�L7���Z��QzW�����-�������o���Q��������m
������,�h��k(�_�-~XZ#�ni��R�{=��f�������g��9��^3�x\�M�����TKt��������b��^j��kM_�����F^�p/���q�����P�"�
���-u{l��ZK/-Z]�q7�x�
>x�q�r�]J��V��P��� ��:���F\+$?��w���U��l�o�F��gO���e��s�[���)$[�"mKW��ss�a%Zx};�dCN�o�Ai����#8��P������5~�'rnF�`��G</x���~*$�]��v,]��)~
c�;>�"+�4��������g�����}}�����f��i����O0������TUX�F#�������E���l`r���r�dT8�[j����i;2-��z��X*�"_����Ma��?�Z}�<MI�����kih�T��c�*�.U��|v��(��M��@�*�r���(a)���*>D�8H�vm��00EG9PG��*���k��IA��:-P�X�%)Z	8t��o(�\
�D�Rxh�,]���0��
��#�KL��,xx���0�l&� P�.�����@<��.m�)��X
�P���))(����M�w/� Dbi�'���3���z�P��������&�,���T��TN\{�,������0�`H�w�8���,��*l�N�u�.�x4�+.��~L-�+>�+�]Wb{�}�(��.1<3	�3)I�����i��������3�3c�����V0�z`<��BZa�X��4����k�)��5>��U[�z �O�P��d����)���_�������ov����$��*��V�<%�p���n�%����S�+x���Xq�a�������*��#-�FH���i��H�YN��K�~2%����I��1r��3T�~��=R��|�Z�m!�?.N����T����X�@7Mc�gz��&.���p���,b�F���e���o����0�jLc�=q����YwL
W������������ =��?��;	������:����p���a�2����&���?B���X�Y�����`�,��&�TPc%�w}lIL���A/��7���Kd*�0�w���K���5�d-�m�Q��������ZZc����N(&
fH��%hy-`�G���RQY]H����������H�,P���Bj���!����n�Q	���:P�e�-U;����]��{G.�����$�O��S0@��T��	�NbBVima��M4����>�_\&���4�c��Q�X�(2-�>;Nz���H"�*�)��=f��h#�m�4:=�3�v>�E%f��/�<6`��������"S�bm
"��<HH��V37�'����D'�w�$F$�B�4B$��]�b�
�g�F%B��W4��E��
H;M���Z*I}R��PKs<�7l�E�xew�t�j-�<q<i�����2M���;w��g��=%F%�B�&m2K,�L��>�p�����|�����&���q�!�����W���Q�@��U���a{Y���O-�&qC���)�;�;a�n��������oL�,���d�#��t��]������7�6'V��W�
4�N�1;�-���|��UAM���h����(��������� �����{V��������P~!����vR�<|�$��0�*��}�a�[:����2��8t^�(2g/�*&NC2����/��O���LJA��V:�
��:T":�d��,n�azm�C��vjp���X�����skw��@�=�?�&I���I���r)cG��H
@���>���W��X�IE�&�u����9WM��*�*����1~���-���~XF�>���wr���a�6������Hi�I�bQ}>-�� �XK\�����@������vE;v=�7�HE�B
��@2�e��e��e��e������w
[�Av�U�l���ye��/��s��"yb0y�,�b9�~CI�?wF��eW�}�6��/�~%Ju<�`e<R�{vL��Ccr8/��L}��>,F���c�����b�������$y��@-W/ijEB�kG����@������z5n����&������y9�����Onk���7e#3�'�AL���@�,u���}&��H����?zF"�	;	�Zm����������h����e������9�oM��O�����n���Q�����<Q�������//o����o��A��kY��������P���s�;������y��~���r���E�-MX�����}>�i2��2����4
>���_<��of����w
V�2Wv���k2�������*���!�5������7�^�E�������d�n?��qC�0
I��:���������c����|���#r�M2#�~�������}�nim��3a���5�A��R_X��hQ��bJl��=���;�Z*���o�iG��}��(P�*�vt�n�5����F1��:�)���h��iF����qxS��Q��z|���'���Q�J��R��7�_��9H�{���]�/����Af��
v�5;�;l;�k;?w8�X�]��.��.��.��]^�]�3&Y�����D`���,�:oY$v|��=�u���w���$��	���WX�%<��|h�f�o�!��D��D�����~~?�����S��r,'E�7g�1E��R���
0)`T!��9��N����q�x1^����O�g����=�{�g����u���@��� ����l�����@�N���#���()@��x@���j{����t:�����D�6�*"����i�����@����8������n�e���R��H��@�*�����@]�@�x��`����-�q�NI��I%����k��a��2S�I%����	�_�RB����?)k@$�NI����K�����B!��4���XK����a
=�nNG��_t�_w�sC/�y�_p�$2��:��A�P����A'��3j��SWH�dh��xz���0����@��7��$�Gv�?�2��+�	1�lq%x�&<m=���FU)Ic���
�;�+�����N%������S��������Nd)��i�U�mDD�#(,��3Ri��l��s�M��5p��y���G����6��n���5
p���w<�a[��������c�?�&�n2��x�#���F����s���e�de�a
�4H+-���C������v������������I�q�T���?���5���2�`p��>�^-4\?Y��\h�|C�-}H3�%a��y��x�����q���T(�_T��E?Q��e&�����TGI��2�3J{�������bu������I����b+��r=�bEr:���f���c���`���d��)H1����W�G;��Z�Pt+�3<&��<tE)����B��Q��Q���l�{cs���B�C����yG�;����r��|�J�b+]����g6���$�7[<�9���Z~�NV������s3�f&���A>5)yo��������y.8�n�u���'q-��V ��m���mz'Hc�������jM�����U�����Be���>��g7�*�&p�5	�"�*T���X9��y���\8��yf&/=�x3���P�0��lU����yDs_�X5)i��l,���L�Jyn�>i[Uu���C��|4m�=|&�h�1���P�Ey?U=e����Q����J�zR�(�C�;�.��?6y~/����i�I9��[h���o�������D�m`��
��������<�2}��X�^)�����V=6s���d� "o�}��XA�T�E�=[��+��^�5:���������r��~���T��/�g�V}�S�UN���j�R�[M���J_r��)�p���B�BK��������P����L�����_�QK��Jc6�J���MJN��&]�+r��T��Y��<���Wy�V�y����H��;_�T�#�UY�b.�.�
�U]��iV]N�`����\P[}�pW��c�f����,����7^e�U�RX��������Y�{Ac�j��������M�Y;-��aD@��PKfw~��/�0PK/�PT �Ares/UT
�
b
Hb�
bux��PK�XUT �ABres/p75/UT
�Gb�Gb�Gbux��PK%EUT�=���f ���res/p75/v3-2socket.csvUT
�%b�%b�Gbux��PK$EUTo��7�� ���res/p75/v3-1socket.csvUT
�%b�%b�Gbux��PK$EUT�e�� ��@res/p75/v3-notebook.csvUT
�%b�%b�Gbux��PK�XUT �A�res/tbl/UT
�Gb�Gb�Gbux��PK$EUT	��� ���res/tbl/v3-1socket.tblUT
�%b�Db�Gbux��PK%EUTUtK�� ���	res/tbl/v3-2socket.tblUT
�%b�Db�Gbux��PK$EUT�w:�� ���res/tbl/v3-notebook.tblUT
�%b�Db�Gbux��PK�XUT �A�
res/csv/UT
�Gb�Gb�Gbux��PK$EUT��(^� ��res/csv/v3-1socket.csvUT
�%b�^
b�Gbux��PK%EUT����h� ���res/csv/v3-2socket.csvUT
�%b�^
b�Gbux��PK$EUT6��{Lt ���res/csv/v3-notebook.csvUT
�%b�^
b�Gbux��PK�XUT �AZres/gif/UT
�GbHb�Gbux��PK%EUT)��f7�8 ���res/gif/v3-1socket.gifUT
�%bHb�Gbux��PK%EUT4���57 ��jKres/gif/v3-2socket.gifUT
�%bHb�Gbux��PK$EUTfw~��/�0 ����res/gif/v3-notebook.gifUT
�%bHb�Gbux��PK^�
#13Simon Riggs
simon.riggs@enterprisedb.com
In reply to: Yura Sokolov (#12)
Re: BufferAlloc: don't take two simultaneous locks

On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
/messages/by-id/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
/messages/by-id/CA+TgmoZkg-04rcNRURt=jAG0Cs5oPyB-qKxH4wqX09e-oXy-nw@mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggs http://www.EnterpriseDB.com/

#14Andres Freund
andres@anarazel.de
In reply to: Yura Sokolov (#12)
Re: BufferAlloc: don't take two simultaneous locks

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:

From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.

+	 * We have pinned buffer and we are single pinner at the moment so there
+	 * is no other pinners.

Seems redundant.

We hold buffer header lock and exclusive partition
+	 * lock if tag is valid. Given these statements it is safe to clear tag
+	 * since no other process can inspect it to the moment.
+	 */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.

+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Why don't you just use LockBufHdr/UnlockBufHdr?

Greetings,

Andres Freund

#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#14)
Re: BufferAlloc: don't take two simultaneous locks

At Fri, 25 Feb 2022 00:04:55 -0800, Andres Freund <andres@anarazel.de> wrote in

Why don't you just use LockBufHdr/UnlockBufHdr?

FWIW, v2 looked fine to me in regards to this point.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Simon Riggs (#13)
Re: BufferAlloc: don't take two simultaneous locks

Hello, Simon.

В Пт, 25/02/2022 в 04:35 +0000, Simon Riggs пишет:

On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
/messages/by-id/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
/messages/by-id/CA+TgmoZkg-04rcNRURt=jAG0Cs5oPyB-qKxH4wqX09e-oXy-nw@mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggs http://www.EnterpriseDB.com/

------

regards,
Yura Sokolov

#17Simon Riggs
simon.riggs@enterprisedb.com
In reply to: Yura Sokolov (#16)
Re: BufferAlloc: don't take two simultaneous locks

On Fri, 25 Feb 2022 at 09:24, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.

I don't insist that you do that, but since the API changes are a few
hours work ISTM better to include in one patch for combined perf
testing. It would be better to put all changes in this area into PG15
than to split it across multiple releases.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

Sounds reasonable.

--
Simon Riggs http://www.EnterpriseDB.com/

#18Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andres Freund (#14)
Re: BufferAlloc: don't take two simultaneous locks

Hello, Andres

В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет:

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:

From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.

+	 * We have pinned buffer and we are single pinner at the moment so there
+	 * is no other pinners.

Seems redundant.

We hold buffer header lock and exclusive partition
+	 * lock if tag is valid. Given these statements it is safe to clear tag
+	 * since no other process can inspect it to the moment.
+	 */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.

+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Sorry, I didn't understand sentence :((

Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

But I thought to use Lock+UnlockBuhHdr instead of safety loop:

buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
if (unlikely(buf_state & BM_LOCKED))
{
buf_state = LockBufHdr(&buf->state);
UnlockBufHdr(&buf->state, buf_state | new_bits);
}

I agree this way code is cleaner. Will do in next version.

-----

regards,
Yura Sokolov

#19Andres Freund
andres@anarazel.de
In reply to: Yura Sokolov (#18)
Re: BufferAlloc: don't take two simultaneous locks

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:

+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

Yea, but that's still being interested in the buffer...

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Sorry, I didn't understand sentence :((

You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.

Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Greetings,

Andres Freund

#20Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andres Freund (#19)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 25/02/2022 в 09:01 -0800, Andres Freund пишет:

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:

+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

Yea, but that's still being interested in the buffer...

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Sorry, I didn't understand sentence :((

You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.

First, both states are valid: before atomic_or and after.
Second, there are no checks for buffer->state while buffer header is locked.
All LockBufHdr users uses result of LockBufHdr. (I just checked that).

Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Write barrier is not free on any platform.

Well, while I don't see problem with modifying buffer->state, there is problem
with modifying buffer->tag: I missed Drop*Buffers doesn't check BM_TAG_VALID
flag. Therefore either I had to add this check to those places, or return to
LockBufHdr+UnlockBufHdr pair.

For patch simplicity I'll return Lock+UnlockBufHdr pair. But it has measurable
impact on low connection numbers on many-sockets.

Show quoted text

Greetings,

Andres Freund

#21Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Simon Riggs (#17)
5 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 25/02/2022 в 09:38 +0000, Simon Riggs пишет:

On Fri, 25 Feb 2022 at 09:24, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.

I don't insist that you do that, but since the API changes are a few
hours work ISTM better to include in one patch for combined perf
testing. It would be better to put all changes in this area into PG15
than to split it across multiple releases.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

Sounds reasonable.

Ok, here is v4.
It is with two commits: one for BufferAlloc locking change and other
for dynahash's freelist avoiding.

Buffer locking patch is same to v2 with some comment changes. Ie it uses
Lock+UnlockBufHdr

For dynahash HASH_REUSE and HASH_ASSIGN as suggested.
HASH_REUSE stores deleted element into per-process static variable.
HASH_ASSIGN uses this element instead of freelist. If there's no
such stored element, it falls back to HASH_ENTER.

I've implemented Robert Haas's suggestion to count element in freelists
instead of nentries:

One idea is to jigger things so that we maintain a count of the total
number of entries that doesn't change except when we allocate, and
then for each freelist partition we maintain the number of entries in
that freelist partition. So then the size of the hash table, instead
of being sum(nentries) is totalsize - sum(nfree).

/messages/by-id/CA+TgmoZkg-04rcNRURt=jAG0Cs5oPyB-qKxH4wqX09e-oXy-nw@mail.gmail.com

It helps to avoid freelist lock just to actualize counters.
I made it with replacing "nentries" with "nfree" and adding
"nalloced" to each freelist. It also makes "hash_update_hash_key" valid
for key that migrates partitions.

I believe, there is no need for "nalloced" for each freelist, and
instead single such field should be in HASHHDR. More, it seems to me
`element_alloc` function needs no acquiring freelist partition lock
since it is called only during initialization of shared hash table.
Am I right?

I didn't go this path in v4 for simplicity, but can put it to v5
if approved.

To be honest, "reuse" patch gives little improvement. But still
measurable on some connection numbers.

I tried to reduce freelist partitions to 8, but it has mixed impact.
Most of time performance is same, but sometimes a bit lower. I
didn't investigate reasons. Perhaps they are not related to buffer
manager.

I didn't introduce new functions BufTableReuse and BufTableAssign
since there are single call to BufTableInsert and two calls to
BufTableDelete. So I reused this functions, just added "reuse" flag
to BufTableDelete.

Tests simple_select for Xeon 8354H, 128MB and 1G shared buffers
for scale 100.

1 socket:
conns | master | patch_v4 | master 1G | patch_v4 1G
--------+------------+------------+------------+------------
1 | 41975 | 41540 | 52898 | 52213
2 | 77693 | 77908 | 97571 | 98371
3 | 114713 | 115522 | 142709 | 145226
5 | 188898 | 187617 | 239322 | 237269
7 | 261516 | 260006 | 329119 | 329449
17 | 521821 | 519473 | 672390 | 662106
27 | 555487 | 555697 | 674630 | 672736
53 | 868213 | 896539 | 1190734 | 1202505
83 | 868232 | 866029 | 1164997 | 1158719
107 | 850477 | 845685 | 1140597 | 1134502
139 | 816311 | 816808 | 1101471 | 1091258
163 | 794788 | 796517 | 1078445 | 1071568
191 | 765934 | 776185 | 1059497 | 1041944
211 | 738656 | 777365 | 1083356 | 1046422
239 | 713124 | 841337 | 1104629 | 1116668
271 | 692138 | 847803 | 1094432 | 1128971
307 | 682919 | 849239 | 1086306 | 1127051
353 | 679449 | 842125 | 1071482 | 1117471
397 | 676217 | 844015 | 1058937 | 1118628

2 sockets:
conns | master | patch_v4 | master 1G | patch_v4 1G
--------+------------+------------+------------+------------
1 | 44317 | 44034 | 53920 | 53583
2 | 81193 | 78621 | 99138 | 97968
3 | 120755 | 115648 | 148102 | 147423
5 | 190007 | 188943 | 232078 | 231029
7 | 258602 | 260649 | 325545 | 318567
17 | 551814 | 552914 | 692312 | 697518
27 | 787353 | 786573 | 1023509 | 1022891
53 | 973880 | 1008534 | 1228274 | 1278194
83 | 1108442 | 1269777 | 1596292 | 1648156
107 | 1072188 | 1339634 | 1542401 | 1664476
139 | 1000446 | 1316372 | 1490757 | 1676127
163 | 967378 | 1257445 | 1461468 | 1655574
191 | 926010 | 1189591 | 1435317 | 1639313
211 | 909919 | 1149905 | 1417437 | 1632764
239 | 895944 | 1115681 | 1393530 | 1616329
271 | 880545 | 1090208 | 1374878 | 1609544
307 | 865560 | 1066798 | 1355164 | 1593769
353 | 857591 | 1046426 | 1330069 | 1584006
397 | 840374 | 1024711 | 1312257 | 1564872

--------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v4-0001-bufmgr-do-not-acquire-two-partition-lo.patchtext/x-patch; charset=UTF-8; name=v4-0001-bufmgr-do-not-acquire-two-partition-lo.patchDownload
From c1b8e6d60030d5d02287ae731ab604feeafa7486 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v4 1/2] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 189 +++++++++++++---------------
 1 file changed, 89 insertions(+), 100 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..5d2781f4813 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1288,93 +1288,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1382,40 +1305,113 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). Given these statements it is safe to
+	 * clear tag since no other process can inspect it to the moment.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now it is safe to use victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1423,13 +1419,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1

v4-1socket.gifimage/gif; name=v4-1socket.gifDownload
GIF89aX��###+++333<<<CCCKKKRRR\\\ccckkkssszzz���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��m	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����XL� ��t�Krc� '(Z�� h �`e�3/��1V�25%@�p��&<�8B�		(���AO``�q��pK0a��x�{�A
��D��}{�
	@`P� z���j`�b��[���c�4��m�'�w�e�d
��b�TF�%Ab`��f�bd4�t����e��D#�V�BL���t3��B@~&2����Q�3	���	^U���f��yS,~��@�P9)i_����~��"�=�&d/"��}�@�y�$�	<����	�@'���F�@@��/)��E�F��	p��
���ct"���d���d�����+u�2�bT����D6tkdt.�"T���a�X��.[�@�B�@Abt	��A���fur�b�A���Uj)V��j��:��*B�4P@�yX�-�+�)����!4��f��@N��'� Pf����o\kA{.�\���2�E@�
CfeA��*�����bN7�A��{P,|�oCT�������&Z�B.�P��n��t'na�+>�;�'���A�XY����t0���u��-Gv:��=W��<��l~�+�1��p2A��\d�wN��w�n��N ��%�&�������u�#I�p�@u���A�OG@~[{2��}��r�ti�U������k�#�a��}������a�����et��$�pA-v������H���L!c��� �����>���	A[��o��8@Dv�y������M�3�t*'�S}����h�����tX�b���Y*�S�������!]��d�S�P�y���&�	��L�Q����wb��%T�I�=��^)����BHw����1t��@�(B5M!Ekc�b��X�F���BFX�-��)������0QOH#C�2�!s����,,�
t:b�(�^Ef� AtX��@F-A�UK��oYc��=��� E�B���WZc ��B�@�A���.���{����lN�=��Q5���rZ2Z�^~:A�zZ���0\!�KR�R ��!�$�$p�����e[����N����Y(��*2���@��O[��)&nL9:� ?/{@<A���LC�y��	gy1��L2�P�K��iA�*kl q+Mk
RjQ��
ad�2�Z���)�F�&���`e&����r��]LC��@��)YL���@&����_$�`2=�!c�b�( BJ �2m!��F&?;u���j���4��MGm�C�h��h�(�
�����g��i	Q%�DY�1���4W��M����CW�~J�ZkT�)W�M��
e�H@�]7��rB�V���8
r���>��k:!�P-ua�:w�2���C��6	R[:����x8��.�����t��7]�4�t��<��O!��M��j1B�P�K� Z}�+P<q���_���`���@|@����=����2�=��d	�^���C�sp5�#��,�_�C��N���UszPdT����Lp|,�������^�*��ELf��L�$�8P4��1�W�U�u(�&l�<��������\nL'$@h-��?������������<�Iv��<H���1�x �
�;EHe�tbsb�-�7g��K�N�x(V�x�N�{3C��f�L�5����0��gG���G�������tjw�_������DOg�M6��|\�F�NG}�=����L	xL�d�5��c\T\j�x��3�v��]2Zz��]@!T�%y�Qt�|�-K�z)��6�b�9`;e������1�*��\L^��S����a>/{0
l�T����
��|�����k��a�w�����
�iPZz���5�F`��8��{���0��
Ij�da���<���n&�$d��t	t�����W����!qZ5~q��K�����3��������Y�(~l~��z�=k.r�5���|���x����
&���?��4�"��a.��������
�l&�R�\
P������B#���DfYu^���u������U�~gG#d���������E��Q������Y���/#�"r:���������UA�=e�;f�!LW��w	���;[c�ua;��Y�q�|�'!Th��o�1�Q;��/p��l��T
�����s�	/0�u�7i��`Qz��:����3	��Ohe� �	E���~G��7_SfvD3(�!�C�Q)5�6�V�P��UJA�d�`������X��x�����������������8��X��x����������������8��X��x����������������8��X��x����������������8��X��x��	����8	�P	����p�0��ypw������xQ�
Ge�%.
!m�����m��H���P=�p������������H��]q(_�(��z�x9�Y�WpX�*�������	a�A!	�i�*��,���`"9��$	�@ �>��>��@�!)K2V��#�x�Q�P9	;I�9���D���9��h���0m��d��V9sX���NU�A�>��#�X�&ii�;��DDx�H�T� 5i�"��	t���8�0���h��	pz��Z�e������a�B��P5�}i��~R9}�6"��cY�e��Q���d ��4���Q���p�v��9
@`�0�8�	�!m��r)��)���o����������(����i���1�8��������9��H���d�)Pd��iaB�A �A	��bp�1
`���Y95�1b�I�5`��&�pP�P�8d�8��P�q�<'����-��:�c`�\��s�89���j��c����
xH�6oI���D *��Q	j�}�A��-;D�2�`\��8��9�a����9�
�DP2{�&X�	vj�lfQ}S%r�������D;�)�����c1'69k&ig�}�s#���>���0�{�U� 
W��%hR�����6�J8��4H�%B%���Y
���S�j����E�*K������z�a�iHr%�;���xJ��E�"�?U�9�0�	���?�K��
�<��-S�����7`�;S*�K �R���������*6(R����SH�
�!��A���QHR�!&��j��:���6?h:{�Kc�L$y�X�b+���j�#��&;*)��*���?E#���Plcb����}S^Evr�q��M���M[��204;G5�������]���)c�Z�L����1�N�0Y�dA�K8W�0��+����!�|�g0x[�	A�p�X������Z���x�V ���[�Kk�i1�:���K���������[���c���j���~[k�;�#�Fc@l�I�f�Y����)��#�-D�}c��K8^	���sC�E�������U�6�{�X{����[i�����$mH��tDG��x������q��Y%e�&�%08(;USN������� ���P#��(�P�,��k[�f#i+�;zu�����y;��eP�.�^�S,�F��!�5��NG`����K�`���D l_#1��D��6��	���j�=���[�|����f�������*���R���������M�`1�E��h�0���`7����b8c�
��>���[T�,	@@����P��������4��*�������L����L��@�+����l�~�+_Z������|\"<�"�1Ia����\A\,��:��h�|��M6�\�`1��|�1��F��<������@R��dy�\!����������M��`��,�
��_���M�<�����8��,*-����
l���� ����R�O����1m���=��Y��@�FM�(��J��L��N��P�R=�T]�V}�X��Z��\��^��`�b=�d]�f}�h��j��l��n��p�r=�t]�v}����<+a���E�c���`%�����
B��A�.�5Q��<+uR��	'�<�W���\2��������B����0�UVBU�%�����T�C�������XA�������p��A�=�D�-��h������y\���L0�D�	����!��M��B��]��������B���	`1�%�9��u���*�m����U�>�^�~���
������� �^�~�����������]��k]������`��W�3��	���0>��������41�O����/�5�M��D\E���c���=�@��P�>�<^8[���:��=��$���> ���=��	��(*� ��p@���#��C�k�

e>���n��+)A��#=�=�M:p�g��Y�=`�q���	IP�:0p p�@	���]90��=��q����vn�.�@��~�>�!�c��E���
p��~�_�
=
�r���g��Po��e^�$ ����p�	q�
��
� =���i��@������9P�����j�������L��H`��}�#�A��N�����n����0�	O>����.(��P�
���ry����_������2�
>��0�
�P�@���n!��0Q��M`�	���2�B����:��&!�D����?��dO<�:�������1�>	��?���. `����en��J��1�������<����O���A��M�+o�-����^�}^���e�
qy�1�
�_������
���?���N@-��`�qo�M����������>���.�FO����?�����
r��$ %-%��,oU���o��-��?�O��>����`��*q�k��B��H-[
D�P�B�
�"���">�e���7�
~�R�C�%M�D�!��-+�,��V0m����e*��f�dQ���C#�*���!�M�>�U�T�U�6T��V��UDDK�����C�f����Fi��>�����Q�I}4����&0S)�	W�b��?V��d�^+*�%�-y���@YtIZLr���F(Y�K)r��Pp��r��g��.�\�p�8������3s�z�<%-�����&��!u��Q��#bwA�L�8�Dy�U��'��;�mz�"`���JL:$�@�P�>�kB)HI�3�P��P�������*4H�P�(�����A	�d�?�C��D�$�7�(�<c+!��n2�
�Z�H#�D��BrH�&���H�8�0����Q�4B�D"����D��J��D��3��f�I��D�����+a�8��=Zv� ���k�PC��"=�[��H����4 D��4m�P��z�`C�Xt#4(<(!�� QJ�/F���7�f��7� /�R���-E��Qc�E�*��H�!�
y(E
�"!
Q���0$���L9
��&x�D�!m��H0��OT��3<,�Y}��� ��W?Hb�DL�T��I�MU�9��
�=O!C|P�1XCz�)�G
19�(B�JH���v��)KE��f��3����%)&:�E���*G8#�P�-���DYHJ*C*Q"�/�6�
�p�"5�b��F8��P�@9����.E�����"!HD9�p��4����j�TB9�^�E
�		�����������O�<|�n�?wVV��zr�B���*%��n�i4'G�� �8� V�����C�����������H=yC���4~�>��RIPz���� TzB���r-z��4:W���}H���������O�v��tDt�?%�|��BR��y� �TK�@�6�`o&��_�D"��$k!W5�/!�`�'A�C�;j+	P��L��A�@Df;�e�(�l�V�E$��T�&�@E$"�E ��k����!���8�$���"U��U5�	�("��FD��H�P�����E�bc
���q*��I�����dt��(�&�q	!��#�`k���}�
��	��d&59�`q�&�D!~�IR��P���)U�JV�	�le,e9�D�����e.�������fO��Lb��
Y�1��L`��:��f4[�J����$���Mn���f8�)C�������b�Nv�3n��A�9Ozj��g>�%������L��P�T@8��A��������b�Q�VT���hF5
��m��]H$�	R��T	@(iJ?��.s�)T:S}�`r�PANsz����rHt��Q����\C*2���TG�j7}������:�jV�y��0��*�AV�U�S�
��WA�;��1�)NRN���w��!����A451Eb����!��BN]�������H���
WhlBV�Em������V�>�eB\q\�
�IV1����\�*%L�['Z�$���
^;�+H"�En�+�/$�S�W���Z���M(o��W�d"
. �k����r���('�k��~U.��$��V�}��B�+	n��K0�SL1TMhw�B��C`��(�70��Wm`�����B���p
C}�$X�
B��)6h* lr���)��CL����@��:�'�����G��(A���@
���4����8���E
�
Z�����I��j�YhYth
��j'%�&1��6��SF+)��
�`Up�&L��bY%R�* h1B�b�8E�MA
F������WI�B��Y�.V	$��H�y4�iCZp�0dB*�H3!JR���]��'qz�%<"�6�( 1�<�aK�m��KX����2�A
���%4����[k���0�,h1�M���@�o-
=�%|������@�Y���W�{e+xOV�U����N{t�����
m���N�{oE�\�:t����oh������)��j2������3�����P��$Z��77�0�vS�k���)��U��D@|5BI��Y`V@�>�A��O�)�Ng0�����W3L��T*gz�?��W�>��1��BH���|�:Bf1d�J����U�-�Z�A�����2�%"*�8���h7�ZnA�������l@��?H�7�����~�/(�@"�h-4��NA_���h�;D$�*�C���V�;E�&��X��@bQ�1@_��A��o�����A{0�������(�0l�����o'[ �X@A �����?[����������!������M�����QP<������T&�H[�������$�<Ad�((PH;��R0�������*�:�Wp�����+Z�::�@e�+���������?�B�B,,(%�H�V8��y������M�*�����A��*�#�9:��vzL�	�B���@��C|�	�AtCDh�
p�G��H��I��J��K��L��M��N��
��D0,��OD�TTEG��6T���x�2��8��*X\�^�DP���a$�bLE�����XH��>O���O���2����j��i��nd>�>��)�PSC��.��RH�X��>h2��CK��Z�
���<��
�(�=�H[�([ ��H�,H���8�x�,��a�@�@���9�7#D�@�N�|����j�5�������l*.�������)��Q8I���@���������R)1(��*Q+-�#[�)@CL����?��gS�H��k��,	�2J���rd�Q�����Z����;�.Ix-/��R��9�8�#�G����B���@��}��%4KcRp���t����C�p�J�������7p�%��������;@��;��^s/�lLbr���HY��L�J�S�K�	�{MwR��H3��%+��J�	�*��Mt�Q����+���.
��P���'�b�v����L��(������Z3�����
��vJ�4��L\�J�Z9����-Lx�[	/��Os
�5(�I�_jI���h�I���z=��O�z�'�b��#�p�+�/��Pq:3�X>�Os���X;�mb�,���t'�N�Qk
%`�l�s��/��L%�]������x���>hPwz�K������4Q)U�����D'��O��+1��Lx���R2��G�Q@��"���Q>�SEJ�u[��%�J��JxO�h���
�LR�Uh@��9�P��I������T�Y���.�S�ZR�z�ZpN�0��V�Q��.DUF�\4U�����\��W�PY]U��	(����I������6��L�Q����r��zU
�V��Z�U���MU���/u
N�L- O���������������C5�d./�����J�Np�r-�	55���jLy�8�� T9�T�R�����X����#�SEXL��"/X�W0W�j�P.���a5;������X��XWO��%�E��J�5�����-Q-�Z��I`��eY�r��}-��H������7x�4`�3p[3(�"(�)�p��k�(8W�yU	�U����������L���}�I�Z�}-n��P
@�m�7�����Cp�&P�%�#�h7����.-Qb5	U�����BN�+��hD8����m���-�;��-]�]��M�P%]v�U�s��1��0�D�	��Q+����Lp�P^���$�`��8�[��5�	mT�+�Dr
V��%�K��_�������P�����e�3�&��� �^�E���[���(H�^�P����:_�`�(�R�
@x����N	�U�P�,��
G���`�V�v�(���(���(��P��� ��7xN�`*�^���*H��'VY�`�P�b6�
A��+�b[!��>��+��*��vC�2��
��:�6����� ��+09�8(�����:C=b�
>0CA�a)�`��MDV�6���p�"^G��0�
�l�J���h��h�:pEz�d`��J��0<[����S��a���/���LC.�����=��Z��O�����3�p�>�E���!4c�!�AK�*�L6�);�Nk�bZ��.�R�P5qfc>~
��.v����p�@���2��_�����|�bH����g��b[f�����C�+�>N)x*�|�$6]0���$.g���8��2FfA���2��zV.���g~^��fN�Nb8�a������'�!�Uw��'���P�.x��~b4���P��Tj�h����$����<�j&��?�������p�$�f�����aj{�e���P�+��S��F=��k�=(=(�'��f��!��?��i,���l��o��n�>k�$�V�9��������B,�$m�+@�;�O���Hl�f!��)\A|A4)\�B�NmM���p���
��
��*���fA`n-��$��L�����@�A�d���B��D�C��$�V����!�~��Gn����oi�Fq�F�^c�F��I��W�g�w����DQn�~��4H�F��o�M
$h�:��E���%0��hp����%8������q���A�I^�T�D)8h ��5��{�q#�IM�)i&���N���r�
�#H�*x�+�_�^O������G���p�1^Z��Oi��h5UC2�4��\����V�;��6��a�i?�U@w�azjB�<�	-���Vt;e��@t��t)uZH���K'S%��D8���Wu���^��N����^��Wo�I?k��n[M�Q��l^7;5�w�K�ua�8)s���geG��di�n��v�2�!	�T�l/�P�Vo�a�Ip�.6H���C���P�����WJ-��4�w?�3��F���X���r}7*b��:v�Z��+3���"JG	IP�X��B����L?��(�M��;�H�7��������6y�B��g�.P����?*���EFJ���y��jEX"�/�R(�`���bv�E��ZH,O��j�jo=������:!R�O�-�����tr/mV��s�2���/�G���5��n����%w��u�Y��{��ra�|�{�jklMs���?��o�Whd��)���Vw/��|����c�W�Y���"��W)Q�!i��/R�@g��(i1 ����2��l����_�� �O��O)�o	[R�Y���'~�
��X�"o��r�/��c�����7�S-K�~�j�j����ZH��/�Th��8h�C�C��W������������=B�Q�����
2l��!��'R�h�"��7r���#��"G�,i�$J��|��qF�*V�@�B�
?)w����'��B�-j�(+�����a�C�R�j:j�*��Z�r���k�>��RH�P��$N���+��r���k��E�b{�Z��M���F�h;����3n���C�z����� �*���c����G�.md��>�>tUPE=�N��"����w��]7�d�5��9���U��n��9��!��&a->2]�A(�;����+D%d��E�{\����������o��,�=��J8�p:�4�%�x �	^�FX=@��T,�pl
Zx!�N$�|
1�`"
���a8M�!�)�h�<���e
��jkt8��.���-6��A
9doI�&+I��D*�R�
3�!�mmA$�Yj�X*���b�R�/���w
Y"C��l�&�qv��j,)TJ8��G�r��'�Y���jT����(���G�<�p�,�Zz)�%qA�^,d�)��b4�
����
���*����;��c��z��������*d�C����*����p�'��,��jH�	A����zk��	a�-����-dk.���6��8Q��{/����/����/�<0��|0�	+�0�
;�0�K|/o���-���1��UP@�v<2�%�|2�)����-��2�1�<3�5�<_,|4,�[G @@T�I@�lO'�E<�K�OC���HS`��}QC�]��Mm@l�I@E(���AEo�a�$��D������-��
m���<��L��}7Q�?W����(N��K���C�x�Jsn�D�C�7��_����t��r�����tG������;C�_4����_��B�����!�'�S����R�M���e�?�B���{B��_���	pr�K���	�o|�����p~	���-Z0"�����
`���5��u"����*P�h�/!�{�ch����+��@���+�^s60�q�"��:B�!`��F2�lD�LL�1�DqE�\�H�2�� ���-P�5�������.t�b��G�6>�xh�h�
���e�H�h^��m1�I.R��$� ��J������m�������P)Hvk	i%+�����0��c����#�9"� `k 98��/R.�)f�i�e���8f2+�� �@+2Mm�����:NM��8@8�,@��/�9�xR���gB�Y�{��
�B�xN.R���K'D��mBt"��"�(�jRd��l���)P����M'4�I��P�H�Hp����X�H@9��|b����$jEx������*LHNw1`QU�UJ�����1�-�����(� 1\��*Z������);j���u!b��Y���x5�3�t�U�bT"Q�i-~�V�R$��k`	����#H�<B�~3���f/�������H?;��4��	mB$`Z�R\�%�n-Qx%D^	��V��l��q-�oq��t����lC>+[�tT���rs�\v~7�=nmzX�HW������+���w!�m�n�_[X�����F0���H��D��6"��
�G�w�/JX�VH�/���%>p�'R�B���@�E�i1�il��(�>|����� @`lX�O�(�.��@{�������L��.��q���"?��
�0�,O#x��K��
���&>�"e)�8�.�v��4r66#���G 7Dv1��(�5��9 ������v���"`4ps��~��Y������
x�D:���|v�Bo�,S��.�^�|�W���w^�CM�B/�rqV��ri�u�^��q�������f.w�iS����6����ms����6��-�q����>7����u����~7��-�y�����7����}�����7�.����?8����;
v4-2socket.gifimage/gif; name=v4-2socket.gifDownload
GIF89aX��###+++333<<<CCCKKKRRR\\\ccckkksss{{{���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��m	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����XL�VA��T�Ld+DNP
�@���A��3o�*!2��x=D0� ��x����@�
L�����'�i�sf6@}�����A�����7�� 4�~Hp�|z��x��{��1�^������4���k�'����U�
��@.fOh@_�T��-P��Ev�Af8'�-���g�8��	���sHP��8�X�'�Y�e�"0bt���!@�K� FD�`Vh���@e�v�M�0�@�hIE $'���]�]w���5 d�Y"��Re�	<����	�F���Y��\�/E9�E�Nx�UD&@�v6�S���I�(����F��L@���R���i��F:��C�Jv�B20@�F8a��-�l�
�idhjKb�����
{Q�|pI���!�)V�(��*���	��@�4P@����-�;�)���">�y�@/��-�R��' P!��[x���|�B��������]�G��A��:a����-f�w�A��@$.�@���[h���0pBe,�ht8��`@x,���V������TQK&���?�<��M�td
�-P�U?�C>��!���=�r�	���f�\r�E����edL�B��'��:A�O~��y0���b�/4�s�#����[8��@��	N_�;>�;����������9~�sv����$�����u�)}l�P�ll�������`vs8��5��� 2��>P�������3�gj�S-��%��������rF��,���V,�[*,�@:11��p��TAXU"��]���L�l!���H�b���/*A�
��2��ao�a�������@1c&�	�M�^�D
1bg�����;��J�N����P
!A��-����Q"f�H[(R �b�Bn�G$
$��&c����!�Q�@h��h�������R�(��3��aW���6�u���H������� J�!� J[���Zd�i�z
�p�j$B������-r�J+r{dQ%�w�b!(��IA����8�u��2_0	������}��������"U�aX'��	8�g���-�FM9^R��T�$���PS��i&��D��<3e?�.��(� t2����(Q�T���
/a����f��j�3AL ���:Z��u�/CH�V��^�b�aR�l!&q���lRj%)8�2����B9��`��-^>�
�M��@$����;z��J[�RHFFk%u��(.��C�,���$��~R	�$�3�>����3��S�0����������*��!3����57�n����A�����P�9���Wh9����T7��
db�C9��-���l�
lb��(O�b ([�)�TR��^?�eo�CRP��5���y-]��Y:1�����!A��TF	���X+bH�N�.��mq3�~��xf�\�`��N@�B `��no���a;��h�
(}�k��
�c�/T����FS��b���|�
��#.x<_���b���Ah���d�=^�uH<+��%���-fl�@u�8&�U&�)�72�����;�N��
�^��J3�9����#�0�.d}�������%3����SX�J
��<>�K�d��U6b&>R�=��Q�����f}.{4����J��9�:!�[&���:��B��<H5�������@A�	d'@�����x�?RG�iL�����\^�1s�2LM3��0��&��K:�aD	�7���46�45%K�	DK0�E��T�RTh����|e+SQ�	�Y�nF�W��:~��>PC�4�HG���P!��y����}�d�^,0j��b�D��������u�RT���O�Z*�L��2���e#c�TDL��"�l���b�����y�f��fkw]fv6,������D�1��@����;q��A� )f�TYG������kR!D6�	6�P�C�p^��>�3����:q���m�{X�|�a���JO
-���^Qf�R��p�&J���^7����&��R���jg�Dwm���T[��9����~T�I�*��>��;m����X���=R��n%*��������
$|��I�h��:SdRC���D� z�=u7/�}���q�3Q(r!�C2}A[d�lm��}r�I��T.�[�q31f #&\��2�G�U�TQ%�w��<xI48M��6�@J��a�N�r`��!f�j5�B72m��5��!e�p ���P��5����w
��rW}��l�dw�}�(�1���`��8�UHP�Q@`�* '����~�e�$$�u�`y�lb�5`x<
06 �����uibrs�`��V�������������8��X��x����������������8��X��x����������������8��X��x����������������8��X��x������b���9Q	�P	9���	�'�	�H{�������
N`��u l��o�� �l�	��+f7�a�p��p����0�1I�69��h�J"m	��z����DY�Fy�Zp��+��N��P���	��AAr�
�!��^��`��A�a����v��p�r9�A0���_�&Xir�t��PoWJ����Y�Jy�����vY�y��3�`y�_�����@Z�VA����Y��	�!��
����{%��h�Yp�
�e���*��8�����0`�0Bg��&�������f�����q��� 6@����@��$��$�"�����)�����E��/�i�i�+�������.e��@`9�BH��z��Y���Y�9C�������!��zg�_���))�ep�|�� ^���x��Y���/
p��0j���9eR�����U�Y��(���y���	aA0a�?l���f��1�@�t�p�s�ug6�2��,���	���	�I<j�f���#@=�s�|3��G���HSL�X��$b�����B�`����9C�����a:�d
.���e#kZmZ_|�����Y�?�Ef#>m�%������c+r��V��Q	��c�Kx����$�1��������Q(
05���	��q
�E�*��U���	]�&��R�z:��=�/|)+RC���J���8�b��u�:�a�2m)tF��yp��H��|	4�
�������H	�o�:�#�� 6$���<K�f"��A��P	�nc�kV!�.�8�e�����j 
2�������	FD�8�"�I%{����!J,�l|i�Q^�����*�0;2��5��8��H��Vi J�)X�[��-]��a{�{�K��Ngj@"�Q��rK�BK�@���:�.����<v�r��s�8	�����e�?
@9G�2���_)�:����#�K/��-�*>�� ���+��r|�e�ID��cK���f;|jc�jk�kV�����������a�n�����F)^��S:r����f�LKi�ZJ `�q���"��8���[�F�g��{\F~�zfp$BK���J[�L�����������s������k������	���Y[�c�K������B��I 8���,%.���w�#��H�Ei��+���#)r��!^`�8`KbSR�����
��M	�d�zK�M�3`{���zp����"l�����_<{US����]�S�R`�����T���$��P
@	`���p	�6lR�lrU�Ul�n��0#�Z��!�����<���[����\U���h��"��f3T�j�|�^��Y�Kx��+0�z��q�MJ"$���I�u;��,����������mR�	��\x���� 
H�\���� d'���c{�s*����v�O'����B��K*��L�G��RY[&���w��P��Z��hEy���\[H��`��&��2A�DI�����s����b���)�V��(B�6��<��>����Ii-�J�������<��2�.P�TA��)�TA�X�
r��p�hm�A��������l�E�������{��_��y�fx=�b]��}����������������=��]��}����������������=��]��}����������������=��]����}���� �������	���@����$�|��
�U����j�P��m�=�����FP�Q����ps}�Be@���[����	>E&�?>��h
�]�a�����siz	�t�
AP�PP�����P�Lv=!�-@U`�����������`A�P$N�� t��
��
���FP�A`Q���
���-���k�^q[ e0@y �W0��=�P��SN���si��8n����
<���@�-����q	_�
��)�)�Ab��8���a�p@��xU�3#D��.��n�.�'�#g��j���-�=��n�-��.�5N�vNQ��A�]��V�-@)�>����`���K���q��0�pi^�n���@�>?�����P���>��p��z��p������#���
���

k�	�`������gk�	-G���rnZ�������?�_����
����/�;P�8��`O�9���n��P����9�? �@�9�o�@�?������8��:��,���V_���P@��\��A�R�J8k�3I�����t������>��H-	�N���`��0�=^V���
�`�`��2�a����q��HAZ�@�N�/9��n�^����.�+��3�����A���^��������p����Q�a��1�sI���t�cLN��e���?����B����P��8�����]���.�]��@��t����_������
���j�~�0����b��yM�Q�{���$��@��q�pI�-��� ��(�!CQ�	�P���Q�JE��-�-Zl�0H��=V��P"VM�D�R�J�-]��S�L�5m��H�=��Th��3QiR�%� �@ah$��E�"��4&+���~V�X�e��B�g�g��k�!����z��
U$�P�.�(�%R�p
F�X�b��v��9�c��;�:e�R�9�����)B�R�&�
�UH�Fd�4���e��m�E��!��q��l���/:���J��B����C���fju��H#^�x�\m����"�4<��"_��E�<���B##������$�@��h��L��:<���@��O��\
��E.
-�E�.(�HD$���E_�(+�@iK%Z�(!>�m���1��B�n���c)D�F�I'�3�C[L��%Z����L�'�\��(�@P&�4�
L5�d,��D�1&�B���B�M=��o�=�t�CX�� i��N9e��32�QI'}-)[d	'�z��ROC*(6��TS+�"����;G�C�Sg��V[��S�0���yJ��[�%��bG�H�E�e����J��Hc��6�'-�%k��CV�BQ�'
?z����V[w��W�F�H�
�d�18G����
�8`��
�$[����O���U��8b�o�
�����_�wb�?���)B��;}��]�T�e��D�5[��	��Q��6��$G����e������B���9z3�Ph���5x�����&:k���":�3�a:���i�������f��QIV%<P�P�P���2����o�X�"i����P���U��'��;r�O*���]�dQa�N�JH��?����OG=� f,aO:z$Q���������B���!�J����{��$Z@�#�O�&Z>r��7VF[B��OY��;O@��:@�=�����}�>�kE_Q>�c
�_0�7�E�l!��,�}t���|�wyn-����B��� o�9�kBQ��!��jJ�-P`7m���!pDL�=�Pe���>8� a"�
�,\(�D����g5�at���h�� !�H0`��������?���t�!��\1�o&��hq�7�p-j��-������nAE�E�A&�M����X
25�c"����)�">���$2�
:j��G������-*�H{��d�)%J��E5�`�k#LJ�V���E�lY�r���-9Y�n����}DBk��.�3���-���
n@�r�a��:���`
s�D���L;-��,�
V��u��}��L2A$
P�a���d.?7r��k3CdY�N�����+�'W������L�	������`a�:	�NA������|�?���hE���R�c���[6�Nb�)KN�v���[�B/UO$<�N5dz����X�C�[v:��A7��
\p=@����,f�N���NB��������t����%��B.p����Q�u�����"�uAN�I��B�'��r�����Jx���WN��<��F���Z�{�*U�9
�2�w��\�\���
t�D/QY�Yi���]�����Bw(D>U�Z���6��Zy���S��&
�����a7z�Q����{\Lt�F�R�H	���bW�&w,���l�H���m�[�`Fe0T��)���'��"�x�WqRh���:�K���-���:�3�$��-j��0���,$��hn8%�cr���O�eK�:Su����$�iS�wp���c*������g:���Pg�)2��V��`�B`�����@�sg��N���N�C8�����	�^�<�eA8t�BUS\��=2�%M� �!$PZS2P�����D�'=j�$�1ki)Jl��@<avg7�������uq�H���3xR4�y'���Pl����  b�B^A��4
=����X����ln#� 6]Kl�QI�2X���mv���1�4�KQ�O�B+�����o�de���I���Ye����H!�-j~���R[��Y�����Zq�k���&�����T���t%�,�)d�g��D#;Ouc"�u�
�uy�mq�9t&2�x���Q���:gw$�0#t����ZU��U;@��H������\[l&��gQY� �8�EL��'���e���%O�=�v ��kA
=��w(�E�M�)������h���k7m�q�|A��3u��)��K�D��-Fe���Y�.5�w� s����|�(��~zP&�	�y���7�]�3�)��������)R �o��-R�8���E��[�4�b��E`�22t�{g�:e`�����?�0��L���'��X���?��������X0@4��*T��s	���-��1?��4v
��S��
,�;��S����	�?�XA�����x������A�A�A�+������3�.��%�����@�u��UY��;a���?����A���/��A��04C2�'���#���
�&\�J��SP8��/�8���.��-,C0D�5��B��
8DCL�D�� @$����@�C����3�,A �XX��['�C���H� t�	��p����A |�p�Zt)C8����#L��D-�W��;��S�=��*�B �u��B���B�BO�8@O�p��3���FpD�oGZ ����x��y��z��{��|��}��~�������
���X~��a�)�|��G��D������������H�PvR������d���G� �X�3�������|>Y��E,�\>J�Jr�
Fh�������a��ZhIPB-��L��:d�
�ZX9r
R ��R��s����S�Q|�x;���@��� 
$�%����NF�������ZP-�r;����1tc'$��2�	�<3P��	��JX�<���k	X�����a�N���2��#�����BD�;v:E�c�� �8TJ}��L��V
WF�l\��s�Y A����W�9���=�$&�J�	=pL��@�����C���Y8,�4�e�d��BPF�M����������K���F20F���� �uz��*��F����c����0�L�p�l-(M�O�C���~�D����;.gd�@L��������6� h� ���N�����O
}��$�4�����<Hh5,�����B�)0N��CS ��������2`��QSz���M��C�s��B��X9e��p���e�P���
�W�KI0��
��1����S2�)H�D�L��d�8��6��-�Q�x�.��3<�"�,�?-�N��C}	�D�(M�3]�F�2�D�h�1�L�������C�S��%��UI��P7��O���Im���C:�-�=���^]�5}R� O������h�eBv#�"�V+�����)[P�[�����eC��W��C'�N��-��	N2HVv�6C���x��	<��,��	��\��4Ra���U6 �����`'������C�u2�\B2(��S����QS�6M���J$Yl�J��.����V-����� P����mW��p�
�&�AC��	�*��4���H��*��	%���mXJY�a����Zs���'(<o-�O���U�LM2���pvm�����41��.���-�Z���}J���.����I`�w
�%r��	Rp�� ����������;�V�d��]��0W����&Ip\����[��	XxTI�O�%�"X�����������]B����P�Y�����������ZI�Np�u�a��������a�&��k��G]��=������J�We�.I����	�����7�	���l�5�Z��V��G
N��������L`Zq����Q�C�������`;$N�������V��Z�_;�\���%��&�G���0�	���������n���@�}�����,�a	����<xv$�[�X�1PU;��J���M������(F���	$�P]�c�Y�]	�@�e��]�eB:�]0�	X��Vc��9���;- �3�Df����0��V �$�D�����B^������5�BV���Z�c��a��������ZOV���	�����P�U	Xp���D�
�JcNh��U�������J�cPu��=��N�O����xb��>bL�c2PU%�&v	-n����j$�4m� ��o�f����*�_KN�pb�cl�	���-`�]U�U��l��b&��eB|f���'q,�h�/����c'v��y-�!�	;�������e��	�^�~W�x������I��y
�$�\Xh�����W�d�S�M����
X�JI��
�U	@�(*���&�F��������Fk��7X��L���	PV�WP���������NF����0�nkA����1 2��2���v��p�^�K���~��~9R���[8������#���/���� ���P��!h	�����k��������m�%h	@��vlZ���V��6���)I	M ��.�1�u�A�n�f�GH���n�F
*6	<���.l�H�����Fo�6���YKx���H�C��T�����~YL���`�O�$���F�F��v�o�]�~p�n��������^s5�����5�F����
����k' ����+`q��
����q�6�����/d �;�n Ok�6	7�r�F8�_r�fc�(�rF)��
��,��E���s��>s��X�P���+g�8���?3�s�[@:�s4�����V?+p�@)@����� 4�3���G��Ds8�����B�2��N����G��4d�G4u�	��]]PM��3�Z��xZ��X@�XuETCGD��I��P-
��5Q

H��C�E[�EY��Y��h�vxt�m��n�G�� �zT���sG�tW�ug�vw�s�	@�s�L��X��ttG;G[O��<O�2��|�"�j���RHv=y��� �h���I[�I��x�����>�Q�����������L�&X:������V�'����
�����K�T���u���B�B����0�?�4m�8���� �R?8o���8pu�]�������0��?_E���O�` @�X��w{�u���������{J���\���Hy�g]p�o{��V�
�{���!����{�o[U�^����o�!xy�<{�oY����Xz���$X��g}��  ���}�U��������������}�}��X���W ��p��G~P}� H�@����UH��X��~P����8��?TZ�z��$&�-���_�x������}j
���[,h� ��
2l��!��'R�h�"��7r���#��y)i2H[^��l��%��2g��i�&����4Y���Z:�-j�(��J�2-��g(+�4�j�*��Z�r��j���G`t-k�,��j�:uBG#P�����.��z���H	���l��Y�F���3nl�V��F��j+�Bs���3h�o$S&8���'+��n��5��rH����m=�w����,�5�.>�7������$�����0�n�:v��@�*
�d��q��n�<��EJ����*!A��VXk��/�����HXA1�|�%4

+��_)�9� �Z)2WT���b���!�!�8�s�)��`IA�Pv�8#�5�T�\���G~���#�AvDK�&�X�)	�($�M:��*Qd�
H�Oj�����Dw��D����p�&�@�B����=�p[tz�c�}�	�#$�a�@X�����*��"<rP��-:)���2O�$�~�U�)�����":R� ~v��*�y�B���QR#���m����t����Y�-M����0�(���J;�U�P�������w�I-��V���2	t".��.G�m�7��.����JT�dK-�T���
{l�/�	�D#C<��P���24�0�3D�	���SN�?���m�j�2�1��h'�SE,��@����>�p�-�0#P�!(I������J:4�
SH�|�,g+�dQ���V5��Z;���	`G��
��k��	R6���Aa$fR����)A�R�mO��8��4�5�b~�v��6B������{^i(m`m�"���T(?A�*��j�}P*��T">��	�����

�><�g�������u����4�.Ky=��5����Tmp"��rI��� �fy������Iap��I�.H��Q�
���&�~�����k�)!S���[�H�=R�S=�
U�
�� @B����� 
�����0CAj���g+�N
s�$
,@ x���'#QH?���6�I\"���8 !�"���-r��^�"�(�1���f<#���5���n|#�(�9�Q��Ya'haF��M�`B��A���<$"��E2����$�������� ���"��� �@E `J���MiJWBd�d, ��YF�����->J�D`e0A���Xr  m��R��!���@���b
��@�p�MfS����4@o
��
�$.��Nw*��Lf(�u������*�Iy6D�yf4��K���0�Ps�s���e53#f@'b� V��E,
�J��#�E&p�����hH@R��"-�Ik�
���!g�'�I� )-j=g*��!�#O%����p�J=P�Nt���TmA�A�Lu*Q�JS��0�-����B�yM�^�����+Y������6
�B��R��Xu,L��b����@$W�6���-�Y���|`4�HIX�0�<`�N�R#f@@EZ�Z��6"�Ebr�B����Zs�3 ��tm!]����N����V�|��H�e"�����xm�	4@��U�{@R����9�{�^�"�<�.:�_2@!`�m!^�!g(Ct=��&J��	�n-r����P5���7����p/r�O$��E�0��"8�EVX�Sd�9�e�c��6v�C��X7~�}H����@�<^
��8�"d[H`
0��$�����fF�Tc��3S��]3C���+���8����]/���s)�b[t�!s^r��liB�����c��
 �A�L=�3D����E�{Q�7��!��?p������PP�j�����4����H'W�
q�-��`�U� ���<v�_�uZ[ �^��s#$��Lu0�]nt�z �6��"lfSx�6�k���&�!��6�}��["H@'F��6k����]��f|"n��@:BY��vC�D�\ �3M'�b�
D��F��i�'�/��O�e�vV�yv��A~t|3$�8w�tnt�8=�g�AZ���'����yB�Nq+�\�a�y'�v�@{"���E$p3#@��b��~N�H��k��	��� ���]����Z��B:R����{��
=�G���0w��=Qw4��@����B�B������iJz����
����t+2�;�=�~z����n8A�^������W��#�w�/>���"@z�����i�DP�)�;����Ch��3����?���R3������q`������6��A 	�)�k`:�����3��_�=`Q��A4�����i�@���y�zB��XA� ��:��B�_tGR���6	��� >��D
�4� 
J�2E$��!&�.!6�>!F�N!V�^!f�n!v�~!���!���!���!���!����L;
res.zipapplication/zip; name=res.zipDownload
PK/�PT res/UT
�
b�b�
bux��PK�PaT res/p75/UT
C�b �bC�bux��PK�PaT� res/p75/v4-1socket.csvUT
C�bC�bC�bux��]�Kn1D�s�@�O� �#@6���C�1jj�z���*�?��~��0���qq�i��H�
�Y�{��l�y��[/��m�Y�%��[c�p�l�P%��yl�;��a*A�?
I�Uvw�:9�]c04���C��l' 6��5?�7��1E�������_�x�(<oPN����"�7)7M�J�9����BNJMyy������Bq���Nlh9��.�]�z��rW�����f����ig��@d�aHf>��z	�-������kg��\��3�����\b�F�nF���Tw���
���WA�����c����i{�
zA���9k��)'M�f��N��������}��|(��0�h����.+�1A��S�RH��+T���Q����}�Q�����)����J=`Eb�q�J���3A>�y�_�`�������2��+q���L<y9#��O~E��N�B��)���������������������6�_�+0W��PK����PK�PaT res/p75/v4-3socket.csvUT
C�bD�bC�bux��]�Kn1D�>E0��N�c� Ab����-NS�`�*V��������.�D$���sI��3?�|!
�?����X�E���X�_����������%���V���o����%	����?���+=*�>KW���S
B�����\2p��;���3g��j/�_?�~��'�L@�����;:r���t�������Z�lH],F��2������R���V�$��C��1�8dU�nI#^����P�q�E1�;iy2���;��9 WObN�c��A��<�}��U|f
R�D[)3O�^u����W���i����B���]����l�	(1�L��W����v_u]8��� B|N��(����n�T��CI����&��R'��`�a��Jqt�XP��M��N�_<���=���C\L���
����X�*����b�Y�M��i����K�����AO��Tx:t1R�z��zS�G�H�!q� O#-R���������S/�V]}�3���T���71�t���PK��!�PK�PaT� res/p75/v4-2socket.csvUT
C�b �bC�bux��]TK�1��)r/��� �Cd3�!���������.��*��>�^�D�����V &_K^H���Z&�t_�a@�}�b��z�<��VT^�uB�<"��'��!��D������R�J2�[���������Q!�*G�n���0U�u����x#Yvy����������U�F��������)� e��I�6��r�j�c_�X���[�t����2�M����R�ZoW����d��p��>#J&f[��oLf���@0L����wCbB6�l&�����Z�+'�@hU�)1	�'�!�k�'��>���}Y��I�
V�)�nQ��;M���w*$pt�~I�u��vb��u��,���1����l���CD�o��$����&v�6��Y�/5!w@W|`�����u_.t;��6�R�P{9�s�zl��v2c:�l<0��;���5��t�`&�/x;in%u��Q=���N��z|q7Nn''e�%_�j>����ii����yi=Op�����n�?PK�t���PK�PaT res/tbl/UT
C�b�bC�bux��PK�PaT res/tbl/v4-1socket.tblUT
C�b�bC�bux����9N�1�{N����r��F��E��8<�!v�$�����K��������9�������q���������am�}�����u{w��	7�����`�nP����@��S��[����_������%�����A+B��C�����b�6�����8y^"v�l�\1T�h`J�� ��y_	�:3���+vW���6�XU�
,}�b�%�k�4�4���,��,  ]��3
��M�mT�q����BA|@�B���n1�-q�;�7D	"�G`$��[���o��i�����`�Z��`n�]�����.�2���������s����#�>)�	��ir��F�)2O�J�K�fm���XV�c��S#�n"b}��U<��6�LpH���]f#v���������#��]&�����!�]��]���PK�Y�w�PK�PaT res/tbl/v4-3socket.tblUT
C�b�bC�bux���SKN�`�s��+$�'�1B��)������2�&����<�q��������{��_�o�?~��������c�>`�������/�������3���������$oIj�.p��o��� ���d;3�:��(�4]d�i�f.�*��h�_@�AF��(��	�J��/�B����p��9�K��]O��\�$sT7�0Y�/��{O������q��E���P���k�A�@X���<�-�f�O2�����!&)�a)��iI~��w2� �{
�Y-��(�a�����8��@6�k�c:�`g	r��+�D� cB��O���m���������,�K"�y�Lq����\7�:r����Q����d������"���3Ln�p�'e��=�:�RD{T����B��X���6f:��v����
�}C�T���;c��PK�����PK�PaT res/tbl/v4-2socket.tblUT
C�b�bC�bux���TKNA��S�>B��s�!��MHPV���f1��r���9��?��/��q��/�O�>�����_�����?�~��]�w_�7��C���>D��2@9	.��G�iv1y�����D��C�x$p��&����Vy��l�H����7T�����m~20�L�2���M��b�,�\`Yr��+��4dg�����d�\@����O���Qa��/���&��G�����&�4�9�!�$>�8
��7`N[e�!����2"6d4��,���U�M���9�4��e��b�����Y�sL�5[��a��	JMW���F�d(����UF�����4w>N-����Q.-���Nz��|�k�kB��m�Kl7R��y��������s��\��%�����P�zbB;7����jr�r�;��@��w.������p{PK����PK�PaT res/csv/UT
C�b��bC�bux��PK�PaT� res/csv/v4-1socket.csvUT
C�bC�bC�bux��=QK�P1�� �8���c���
bF�����/��;�?���������_����������������@��3C%mvx��t�:��r����D�B�+k]�:)��2�.����d�V{XEP���B���0Il�K5��T���:3����u��e������9V�C`j�y�/��Pj�P����i���Lj4!�&�S����j����z���r/�9�����:y=XHW.'c��]�#n�6"�Q�}*�Z��qgIF��a,������N������8S-�2z���4p�w�t�8��O�Ud�Y���d�]�]
F{��������l�������2���)��PK�-��\�PK�PaT� res/csv/v4-3socket.csvUT
D�bC�bD�bux��=R[�1��)8@>��s��V+$~X����B#M:S�����x{{��/�_�����o_~���O�����|+L��K�8&��.��%k�#��u\�0���B{U��>'����3��[���X�,�]����i�AK���%�JZ1VR��~�ej#�^~A�;o�&���FHQ��%Bz��L����R�
l�p!��q!��zw��YX���Z��h����i�|�W,��f�R.�{z�G�C�0z������RY~��l�t���c $����K]�B��;����nH��IKB�v2k�X��
�b��\���q6N��;�G��K@���m�����d��oBmz���TO��dYNi�yi8&�-������PK�f�po�PK�PaT� res/csv/v4-2socket.csvUT
C�bC�bC�bux��=RMj^1��9���=F	!�M�������xX��H�������������~�>�������O������������Q�mGoc�:w�wj��]�T�"�/�!�P�v����q�j71s�jt�)�sM#<��#��Hd:9�S�b�h�MY<HE�����F����)�f����Z�1~�E�v'_e�"�����Z(�`�K)�_�Mn�pu1�^��qOb�V��
k����w����QN����sF0s0t�N���<������k!B�9�F����J?J2[���W����0v�
���-k5dDe���rK�p?F�t5r���Yo#W�X��u��4��u��H�[�ssl��oi-��r�%�/�XJ�PK��9�n�PK�PaT res/gif/UT
D�b�bD�bux��PK�PaTi7 res/gif/v4-2socket.gifUT
C�b�bC�bux�����S���?������#���������)��GJ{t# GI��G7��)H�x/����������������}�<�t�U\l�$����B���������^�|YAAA]]]KK���������������'((���'���������eeeudrc
���\�Cn�'�7��;�5���MMMo����!���{{�k��+��������������u���9r�
�e���@n\��i�[����y��c��y��c��s��{��`?yb���_=���:Y�7O�����#��W����6&�V�6�7����?�v�Ltn��m}h�Zk����Y����������U��_55U��W�2��������1A^�@^Y���o\���F��@��x��l���Z��V����������������`����������������������������/+_�V�k��}����w������g��>�
��M��,��/��[>�0�y}���umo�o�����������������,��lP?�o�nP���>�����N����|�;>��h���N=�8=���������>��}?>��z�w������g�?�����������/N�R/�^\|�=;������������_�g���_�E����_���"���y��3�w��!����{��l�#�%������
����
�k#
��_`��
�F�%-�F������
�^O�+��6U���B
6W���O�vk&�U^�T-T�4y���T/�q�O0j�Yn���ro�]m�8Vh�V��vv�>��f��4n��.n�6�DkM�D�����;����	��13���'4����H����ih��E`�Y�f���8"i�w~N!:N����� ����'VD�����y4�����0� ��g��>�j����p�>�h������U4����H�������)AZ(YV�T��Z�0�b��7�-c�s,a��\'tH�SY�����}��\y51�H	�H��G6$	�n[���� ��^���L���kr���x������� ���+�2�Bu0b�/�n��c��Ct7�r:�5�;zh+�2c��P��'"��d�2���,�tH�����gVn���=U�Xd�rfx�9�$Hjf�n
��������6s�>4�%AW�m�V�a�>^,����Uv�N���63"���3{y��&��,����,�4�i�L/�~+e�,��
}!��4����*p]O����������2H�/���`�F��m����Q�C$	/A�D����-�G���5�&p}A^"��b���Y���5/T]�a��X���dM��J���n������5���,���A�x�+�;/x�q �0�i�������o�����Z����y�-R�;c��mu�����No����h���[���I��t;I/=�����I�y�Q��5�������"�l9���W���REv{�����Ap�9�����
��k���
�����T�g�B72KQb�����b�![R� �$B���~��U]r�sh�L��^#B/*1����7�a+]F�Ly�V���]d��U\�:EJX�����F������`�{i"N���_�������6V"������c��0�Tf�_t���W�5�i�H��L��}.���XxN<|��M�
B��o��qX)�6�U��q������o��&t�Z^/��������%6�d����4����������8��`�n��t�o����8/:��8�������w%BF��Hcr�@`���J}!	�|���F��I$�a��+chOEQv��3R����p�B8��@#E��~C���O��.�?$�
KVi��w�*O������!����������\�|Z�*��FbsV>��>�5���-'�x�&����A�Ea�=����*u��	�����`yX�\���Pj�V�x�e������
�Q@kc�w?���HE!�+Z�������]�`.�5���VJ_��{_�R&����3����*��3����u����������-c�Uj������uu}.�$�
!���MZ/dF&W��
����b�)4J���;	(e���'(��g,��q.��05���&���>+�l��d�v}��Z�o����bg�4�����^|�Ke`n?`y�4�tw���dV`��F}����������I�9�CK�yd��M���	h-sB������Gh����-��zu����-E�W�h�R�C)������O3Y�����{�@{���������n��w��8�.���w<M��66���a���I<G����^�LXhl��H�q-M�|Z����� *���i��r}�\��UR�B��M�i?D_Mwy`�RZ5
��;���T��[�I]�����OP#�0m�#�����
��*���t�~�e���=ZH���{Ia�>�����O�!r��~c���{E�����
�J�E~��?o���P�p!�'���M���*{��o���"�UWg������^����[,c�i��"�%p�C�������z���[I#NClW��+y��o)�������?�A�.�����\�J'��.�!���r�:������^�_��������"���?�n���!�g,������z�|YB���26��G�{-�B���d��O �����q���}���9�!?�-7^G�.�Iw3��h�gIId%zp��-[
i�S�E��c�'�kv�����-��uO�����J2U#lD %���@�v�������=�M��@�1R�Q��z��s\��c>���J���P�U��_{�C��u7J(
�� 6���/�����x�9�i��)������s��`�.n�:�tN�����37���W6uq�y�}���wF�ni?p�����������{��I�j|~�e��;l�c24�_�_�&.Oh��Zf{;gP
�����
�p7�J�����v+��H#�� ;�DL}L(��JB�#���a����}��X�?T-Z��S(\����;,U���c���4���0"]�=�@Q��D�R+�0��S%`R8���=�y_��!�Q,��`y���k�%�t�#\�,(��HcG�	�#��������������w}����$���~kr��(�V��`��|.����x>8�l�+&�B����<��u=�����S�c@+���C!�\��8��{�	K�1o���{.)�I}�I6ICaI#9Ic�IKISI�������6����r�W�����7.���Sv�SvmR>������/��\�P�_|S~qn��{����;����^��!r�
) A$�D�>��i�O������d1�9�nD�`�})������t�@��t��z�K"8��g�������r�W
��TR;�R
E� �@N�����J��m������}���&?3`��DI:�&J30�K'<���3�y���f&���TD�� b4�';P%[S.sx�N�8����
�`HD3t���V\F�m���l^�����P��G<,�q
�S:�G�wrs��N�qv�l�,��e<�/l3S,L�����������'����rlr��1��*���\HJ��rN���9!w�������K��At�r��=���������a� ���y`\${��`1Q,[���������z�,����w�Wp9����4] �A7�cuV9dA���-[��K�r���C�%����Cg���E�����Ewb���)p�G@�Sr1+�I�Z���n�	����W�t�}r�Auds��S^��F��/[���w3�zn4��	��D���%��3a����Nd��
����*�/���,�D�QY���+O��O�8�_A ���B�@��WF���lj��\_���������Q��t[�N���X���r�rw$ ����
�������<�������M��Q�S��
���TD�b�+Ht����[����!��n�Fa�Mb;W�����,H%�1C�I�����"Do�d�N%]�hU=v�p�/�d��X��^�1r���es�������_����)!�1.e�~R��K�~�U�?X����M+���A�>"��+U��h�4�QcC-�,u�xO��zMZ
A�w{�c
��w*�ly�/�z����Q��Q4�X���^��@�j�VD5��\z�Im|/l(S������S"pR���t��Z�w�����&�W����(�!U�����������I2�����]o���'�cF��!���7B��P
[���5��sZ�C����=�E�������&���$���|G*Ku���+��+��X4I�Fc�h:�.�3>�H0��Y�v3jm�{�=�v���A��A[�~��f@��oO+Z]����>\��J9AY:�,�����v�jc�{�v���[�9�"�����9{�;V�����O���g��=���F����g��|F�i��j�;�4����}�A��E���!�^������W��G�)d�|XBa��$�����a�h��~�������u�D����P���������o�,Ft3F��� 3���?�3J������DV�_�8`��Q���bf���r�M-���"����Gsd�86G9��
FZR�����	 F�p��G� ���J��C��
����ex���$':lW�)�6�LIU�Jl�'��&
8XU��^D�A��R��`)-{���sb ����w�_f��l������E����"�
��C���e������������?�w��I�����jR?�!�%s���.15����G��40<t:i*�����c$������������2���.�f�{����U�Q>7����=Z��K�V��;#�wF^s@�`���<��E'�G@���\ySg�RY�����6��9���_f[�9�{���yC�|N�l�������@�,��Rs![j�Rs����1�Ck�/�MPGj���
_��:��w3q�����v�N���n�0�%W���Kq�K	�KI�K)�H�i���Y��9��y����E��VJ5V�W^��T�Tu������]�Xm�Xmv\m]GG����e%c��W)��4"�]���/^�~z�0�Z�9�����Tx������,wd�����v
���#����M�=��u����8������^h��M�������{�����\$�&�d�.@l�
��14TC����bw�f&m�j����?��8�5��J��Iv*&����$"��X4L��^�jF������Ml6C�����!��=�������hE�&�9��{pHO��?�y=<��T?Yl�m`�>>'m=@��� k�����84Q�������\!�
@��C�Y)�j����=�_��fK�����?G��{��bY�5�3�)�
x�l�5��P���K5���`�f�����jo���=��m6�m&�����v��"x��a�=�S���s\�
����w��������O(�q@������3�s��EQ��M$`%-h$�!���������t>+���3��v�?_i���|�
�C�����|Ge��2]�����A�$�3n��B����PN9	y�� >"
�B���d��"���)/.���,���g�/�T_Pe"���w��Z(f�}��?��T3�<���>g
��#���{��;�l�7�'���������pg�Od�ONh���Xg�s��|�x6������k�����q,�M���|L���8���}%�}��]�v������w��^3
O�|A�����|��?�3 ���;��E���d��`�57�15��������R�u5��JM#�.�i{�B���,�_�e;�J�8jf0���]z����1S�b
�\���:���\<���J�������'_�PY�b@V)�{\������5��N%�;� 9#�eV��H#l�E�q��$fa8�v�r�U��e�k�[E���{�YX��h�O�y��6�x���Es[P^K��x�{���f1�W8/���T��AE����c�����m=&w�����Y�����	W�<'=��AZ�Ere3$.��
`�-�9u�&��������e���B=R�sxU���2�����U�T+�2]�X���&���M&��|D�C�A����`�����ZNH>�j�T�yQ��wH��>�(���������R�JE���;�a^T�z��^T����X�O:�z\}�Wt)�1�$ZU������P���%���Y$6M��W.Y��-V��Q��u�����=�Kv���P��~�#��WG��W��O.����t�
$�v���sj.����������%�9$eIF���������e����W����i��
<sKcx�~�����-��=(�^���
���U�C���@"+e��Il<>K37e��IK*�i����'�~@!w����2,���X��4�[�}���?2|5f��R���^�~�N���uBf�2$��w{3��*�-���+��N��EI o>��f�����s�A�T?�q�����^�Js�#L�9bY*f�U��ez�`5B��?�LX:�h��������t��"���j����]�L��h^���@���-�C��/��>���<J���������*�@e����ka'�)���>�Q��"��
S.�m���=/��s�v�s�bf`}I��A�H��8���Z��y�)roy����@�d�Y���`��]-p�Tz.���x�n���oRd��4x3�������(�S��T�1����Em�gLV�"<3�?��CJ������bv��d~�)vq;����7��u���g����;�P3�E&��F5���z���n������iv#��������Gp�1_����p���|�S������9��N����B��n�^���"��(o�L,����L_�p��1	�(<&.n(�c@."o�92�l���L�������M�s-��{�`��9���7?,�s�D2���g�	�_���Lf�g;\��/�,0�������w}v���>y&�_�Ig�3���:��/���g�(��Md��6M*M��Qr��|'R�F�=a�K���J��c���?�Is'I]2�p�s��9�$jl��f��������k�x�U����e���k�}R7e����`#sz���H��G*�L������1�B{l���/6;Y&���~*��7|=�X^SIq����.{��z`c�����-��$rj���r�������m�A�QD�����3g�DU�[��h��V������@aL1^:5�lI��n+�W���L����M����68�u��l���be*���������v:��f�L�<�P�9�5M��[�H����=�pD����G/������K<e;
>�m~~�i��
����}0��0��D0���?�TCC4M6
��������[����aYgo��Fs�o������cv5`�'i�Ed�|���5�C'[Yc���3�e�{_�jQ���Gs�x�CI�V�v��	Jy��v��}�`���#3&�(Lt�q�����v��v�8���l~���`��z�V��LEY�������o���������:�����-R����������'Fu�h+3���)4���Z���,�.�>���A1T<��_��'7���b4}�)��h@�"d�5H�������(�|��p2�&��a7C�h�]Ug�n���n����3�)�yv3�-���4�+X�2ICi)K��:Gp��8}U?����5�����V#�U�%�nc���b���<!/�c�*F;[B���?~�sO����9N���-����_���������`����h�0s��G��X��.s2�q-�;i��("5JF[m#+:|�Ec1`%G#�i3�0��M�`��_(-j���b��l�t�|9��e��~�����{U��_l=���_��b�����S�������f��>%�����B�&�,[���/M�dI��Kj��U3
q���d�
y4��`Y%�Y\�=��W�������b	r���r7�`V�j��p����Z��������w����t��[��-�$��_<����d�hD�b*{|4������_uNnMK�+lK$��x��Ut�p^��~���9~���-UE1���ax���y,�h�g���.a�rG(�O�$g��=&�G" ��������9V+,A���&��R8�)�oR� 0�O;��G���~}h��/}o��v-��q���g�{r���������>��u(�1����H�cm��_�7�������w��h9X�2&
���'!���K�\����VEB_T�,��4�E�������D�f����WDXq<��]�K_	�k��+5y�4��5y�.G�t�-A��e)�K��Y�1�vPHS�j�����V;7s����G����+�0�5M~�~�_NQK�Xb���%���q*��g�^dAP������N��=���h�i�_;��7� �������+08��HNn1���p�=W!�UFcc�o/�u��2��?�	&"���bB@���y:H0h,����/j����R�������,�a��i�z������������q0�2����P	�{��=|�A�3����1��l�q���������������G��3��`4Q3��=�%�����OG��zL$p��H��H��Hr��2���`'�B�#�)@LYU��hV�^?f� *,
q���I=�Oo�����7 ��K~��1�������J�Ks���������s�U��r�lT!����w��A�3M�����xxL-?M�eC�X���Hj5��[�]P��#`kAjKQ}o��I��g��!F�p�h=�����4X�&��aC���p����U�����NT��F��D@-TUg�-��li�g<i
 os�V�
�;D��b`��QK����*�9�7`�1�8w�aB#��E;�6:���55Pd�,���
u/��t��o��M��������6��Ox��+/���t*3�<4/]��[�+n���{�{9�qA�����U�2h�D?;���V���^��3~S5���MW����xEa�A�s&�S
�����N1�{��J�7�6��{]�) �{�����3��,���c��r�h���`yg��3�����[�KZ� _;$?�n����
������<�/r�kn����!�����wU-�[�OJ�h_���'(�����^TB�[f+a���I>�<�]����A��:�p��v���x�
�&����G���K��U0j������Eb|�rb�g0�
'\�d7�V���P��T����5"�5;9�_7��@M��^��_x��<���"���g����
��	���[k<�xog�Z�������@Qy}��WJZ��)a��j�zY&g���v��R4w����R��E�#��%�G"��[�Z������+�5�Kf�J��8��^C1[�o�a��le4�kl�lVf3����q�D�����/e�����W4��9����-E�����iT${�RK�K?C���h�kA�3�%$��c?���Y5�o���e�B�V?�&�+^�)���\*��uL��/D�R�oj���2�]
�� �"�1Q�Y_�i#�jk�3����5r�3*c8��
H�Ir�}W�&�������<��<RDJ�z
9f�Oi��$l	���g��G��-���+av�P��XH���������
P�xq�OK��c~��rB%�r7�����'�
1 ��&#�c��j9� ���6|�����c�G��MN�_73�'���������Uj}J����!:��AL(����|��&��Q��
8d�-\*�f'�����u"`��?����_�t�0���C8b�v��)3�AD&e�����B��:�,���+��Y���w�Z9���k�$u��b�:�R�=[�#�ez���v���W�u�{�����v��G���+�`X�����g�6*eEn(��QlB{��t2�_U�����/?o�%����Nv���:�������uy}&�.rg.����&�^]�����n�v3��qmZ���Mn�g�R�!AR����I����nf�+Ct�
W�i7y��%�!�P��79�[lH�oC[�O�������cd���Z���E4�Db9�/��Y�V�n��\��i����5[L@2�^�!�����z7��-V��G�H�8�ZqDJ�^����hD�S�3��3a�)p(O�zv����G��b\6^/������s-����eC�6$C��Ot����%Ii���D�������#H��O=<��M������:�6��5�sg��9���n����r�H����z�b������� �����^�x����g���T����B��a�C����p�iP6	t�e�>�YK�s�,1���c���|fF2�����f������/���v����E����R�[��O��WY�K!�����	z��s�'�Hi%�:�U����q
�dU�~�(y�h2
���c����a�
N������8*3��0�*OJ����K�M����Y�;&�����:jg�S����kT�5p�n<�i��.����Gs�N�����}
9�S��]	:5X��~��S��1��1��PB�i��L`]�x�;���n��0���R0�
Y#��Mg�����3<�������qu�.;��,D����BN���>H�
��Eu�{�o^8�I����q�~�28�~P�k/�v�2e!�/
O�!�
@�DX���9�2F�&N����R��E��	�6��;j44������mBt,���J6	Xd�?���H}��tD���c�5h����24,x�o�������[�rQ�����#�����?�*8b���~���`D 8XOMb/�����o��{�+2��Hzk������s��1D�1~���d��V:1�2�F�Yqb�qri����f�Bq�������O����QdT��m�����'%���J��~�<�R.��/E?�`�[���;�8������o)��t=_��@Y�%�O�W��;:���L���18����'[!�-m��7���@qO�����^����?���()��d�y��%#O���W����������q����N�_{���d����zS�����+b6�;�����0�����VQu/����Y��W�5<1|&������������Q�g��(�~+n���W+A���f�3Ba�T�W���MK�i�c2|B����W��;��	��f7J�G/���x3�s�6���`zV/*�/�%�{@�c�0��:J��x|p*:B�}��'��g�T�{���D�����{g�S����v��Y�����e�������h���k����Pc�XN���+�s��|�Yi�^Te���E	Q5��e7����W��o'���F�n$p)��fY�Qt���2�Y.��N�*x���Y�+W�&J\�D��u��C��PM��w?���g'_:eV��m���|���X��?b�%I;��ro����F5�f���'����'�i��u��%cY���	�6>�\9�q��=���&pu��?�����w�h*�h;)��h:A�D�������\�q�	���A��T����f�MVM��`��@�/��=^�%i�j�m��u�b��T����f��uF��S���������h�� �e��!*�C<�g���E�4����o����B���u���lv*>����+Q�	$D^`U���g��M��va���~IU�
�����7b��F���~�g%F�����["�"������~U��A��[p�XLE�\�$��E1T��j&��cZ��#R
����{��yG0��l�Y��_�z��l��Oe�|�y�F�4��>����+}�Dm���z<� 03)�X-7����*A��X�d��P7���6�����W����$�-tdfpYk��o4�MaQ�r��Ia������9�}h�������}��_���4w3X�������vj�l��$��^[c���wa��q<��"�i�������|o���t�[��X1�Z�+�Y>*4�T	�T2<��������_�H���R������N&��`8A�o�%T���HD]e�n?����z������U�����o4E�JY�G����N�tZ�0CE�
��'1?<%UT}����P�h �*���
_6[c<��^<��:Ma����
�-���������AoN�� "�h�^��d0�?�b�C6���0���xPq/LNK�{�yN�{��t!���a�a��c	QV3���p���%�N����2F��]���r����G��_r�jZC���(�\�����d����%�Uf��m]$�\n����9�������Xu16���<�15El*��]���I?2�c���i&n
�KL���`�
���y�|���m�;��.�����Fg���7x���0�CN3~"�&���~%���i�
�e���	W&���	7@��i�;1�Z��G��2������<��{JvB@!�MEy�e�d���v>���b��Q�<g �7FzT8��q�GeD=�q�%�zB_�#�$&��E�s�b
�n9���>�<b�]+g�s���������uL�N)���]iG�0'tK��R=:w�BPe�A ��BKU�~�(p�&�8D�l���]�V$W�m_A�+��v��[L�c��@_�=�#���s*1��#yQ��16!2�l; ��kc�2N�����?�8�g[������`�}���e�+����SM��E=q8��v�Dt\�a�t[��hC�]*U���U�	;���:���Ip�����D-D�7�C������7�P�q����{"�3���v�P*ulO���R�'d���w��q���l~�Nhn���Z��q��D�?6p|u�D������c���|8p%p��T.����g"p�(k����D���W�&�9��z���&�6����I���p!`�a��F�	�Ma+R2XD����AH9"�F�S,1w�S��������y����t�{y������W���U�4*7�M��j�!�����=<�����f7J�i���k�*h9���:[OX��]�d�	��o<�_Y��B�wl�v��bK�@�'}��4c�� vg��<GCd.�����YB��1�N�:z�A�����q���?8�1fJ���M���"l��T�szT(�/�U����k�^����-\Q�������w�=����\�TY��`�w�Z����{��-	�+�����3z���I:q�7�y��Z�?
��|�TA�P��Op�p���.��6M��OI��Z���M\�t8py�H":�{"�A�Qu�Q�=��=3���/����	"���
�p��#'v�- ��y�������H�=����r�i�=TDS&����W��T<�[i[W�=��s�6�3�k�=�������4@������G���T����E�G��v_��k��/H�x���E��Nm~���o�?ce�Z^�xW�C�Kx�M*-�\h�����&���,yBd���J^j�+Mj���Y�<���A�!P��#
g����������{[����G���x�e�F3���K�K�z�6_�#����
bnni3�+b
���:b{� &aV�VYa6�����%�&��%N���%���\�%\���cK�S�y�Jx��y_���	��������O�C����PK�F�[6i7PK�PaTh5 res/gif/v4-1socket.gifUT
C�b�bC�bux�����S���5�I!�PB��{A��������*"*b��`@�PD:Q	��4��"R�nhJS�(b���~�������3{~<;���O���SA?��
����TTT���<z�������������{@@@HHHTT��+WI$��{�������***j�����ZFmM������������m����L_���o����/7����jo^�j�����Tu��kgjj��
3��)z�BM�|C�|{�dS�R���������Y-�+�}}��!�|_��H��4��}a�j~���_�<R�<���T�ha�nY��Z�`M7��7�N��.v��u�X����#��o�������4.����o\n}?T5?M�����h\��_^n_��X�Z�la�o_Yja��WW���GFF���7z?|an��|���X��8����Z����4?��8�^^X��0��>2�q����������������������>.�n�M�L�}z;����y~�������sk}dk���������������O�_V��;K�/�l�R������[k����w�����vw��>/�����������������m{�`w�p��������k��;;{�����v�����V��n|�;������o������~������?�����	�������e�s���A�!��4�����hE�"l]��=������G�����J��We���
�<�P/d������B�+��C
O��b�Hi����G�l�3������G�MU3�|"�����{��*��\=�>Uuf��f�%vm����v�=a{i�����xJ����B��>N�r�_��!������z	 L.\�CP� �Q��]}��F��'�b{0&��6�4�~�@��j4Z��+?D��T���;G�r'H0ov[!�����7�x� �_��������_��O/�92���m��f������Ma�P8��c{�(�QV����`�����C���?��.�qZ$lW���QNG�P��G�G%�S@����;*����)n���8@����@��L���=�V���Z�w�e���4�q���Y~�R�BX��P	�U�Vz[=��Cp]��0�.��
1�O��?Y��<��UZ�#���R�+E���upH�L-�Bi�e����n(G�[B����AO�G�lq������x�i�A�������{��?������
�Gt?�j�/?/~5<i��U!���������=�@����@h�
���X��)��G�1�������Iq�U�.�TU		��;�R����/DY��9�R��o�P�R���0�b�!�r�h�gf��q�U�&N��V�%�
��^g���q1^�*�`�S�u�DQ�.v����^<W�)�m���QUrmo%����6~����}c-B���O]3�P-�K����C�wF5�j��z1AH��+)�|�e��uT�E�_���m��\��`��;��=��^(vsu�n������M�V��r��(>����wFa�D������m-��������p����)��4k�}��Em~9���;��T�Y�{J��l�L!C�V���K�+������p�9�8�����>:��J=}�|��(���4�:��g9X���������_��~}��"f�G,t�t2G�����J7����z�v�����\\�+KC=���g� �����������5I����\�g5�����]T�F�J�z�d~U��It�� ���7���@�
�*�E+���W��.����Z���GL�W�`���v��n9�x��$H*m����@��+�PF(����\�s�[o1��P�����I�)���yAXg����������R�.O-�v�������1��'�
��^�����{��\��|��c�����{=a��[�U,!�/��V_5U�`�p
�����89J]��������	��qs+����
\�"u�� B��
��'�j_�c�v�^%-wYS�9
�����{.��A�'�^��N9�U���\

S���H��-G�q��ga����:u��bT����*=�2�����������@��rRh�A��h�I�fW��0#�S��+��b�l���f�w��!��h
V"n�t>[�I}N7o�Md	9���?wm�g��;|����?P��Yw~O�
�Z���3����P�� ^Y�\����q��2K�f���<N[�*��;���e����������I
��	��B��tD���?���=9~f��M|��yg����]gf^���O\��O�G���/M�9�eJ�l+;���w�[�Q���<f��������PDy �uhSx%����z����]��K{����D.�cehj��;��3E�u���H���k�t~z������q�k��m�����j*�m���iG�R6Y#O.:�*a��_���h���cm����^�z�l.�f����-����m}�G��k^��6��\��z�S�����9$A�k0a�e��Q�������M�|��Cq��Ab�N�%��z8������x����g��*.�m]/L8cr�o<9����mq�	���'D��3t������
�$$���������1�K�����{�(���&��V��#�G��OPp��od�,z�����Wjd�p
�{�)�O?g�{S�yH5�2���[���SQSI�r���`4�w1��%N��M��������;��lH^C�U���GN���������.��R
�}��O���[}�����lRT���~Fw.?�p�8��{�\����S}3��w������%B?ZD��mIo�O��8]�P�}w��Z�>���pW!���N�{������H��h���$.�M��e~�V���g{��j�������&�e&<9w.�R��%3	���f�}D	�?����>���U�T�3���)G���;��V��aO����;y��~\G4������5�U�9��2�7�20~kLU�>J��$��s3;H���&q�D/��@
��R}e��(4Uf��
�
���I|���t<b�G�-q���ik������p
G��.92�����9��h<~���k�G�h�l��tS3zt-E��N�P���J
�+�h��o	$Jr�C����R�f)��XJ�qJ�OJnBJ^^JAkJ�lJ�a
M,��8��'�IBje^jUkj�lj�a*C�^���F�{�	�Z���������ux�[,��8��'m0A0�x�H1Fd�����X�b=�p&��X��������n	�t�p*L���R�;���;�C��b�����8A�8�oM����S����e��_�\VC��g�?��O�M2�}3�c}>
h02��2e�`��L�9���<"1�x��5��$�`�xRB�~<H#+������������;xe��(n&d@���q `f�*j�@(%��. ��@�$g-<G�@���
��2��#��;��9��h9�@����
����������&Tbx��p9�QP� 51��gJA�@xc@����:�K�����<e���axY�n���<�9��u~	�B���{q���D}{���f:�77�g�-NK��n���	�g�\��7�!1A�z0���
1+�iE��?��Tl]z�_��R0�R Lj��T�1��/��������wA��I&���d7�G9�W��04��:g-Qo%���B��c��R�AQ+��&Fe��```$�W����p (
G0�����q���rs��j
�}�P�g)��Dx@��46Y���D
�B����2	d���%{�%������F9�
@D��&Y-��f���������>	c	��C�O������d�������*>��k�(N�y�������)A��]��v������PY�0�#D_�z��"q��n�W�
dfY�g��e���l��Cz�����+�����:/lb�����_�"s2����/��bvE����FB����s�l��r���pi�C��N�e�@��L��
h������e��t��������j_���$�Nk�QxG9��{���*���gUYk���������������SV�o�a�����T�����L���;n*v�8��6:)�@s��kB�s�d��
*f����L���v�ux�"�������@�kG��/?��;��':�h|�J�P�������H^��9@�%���C������v���.�z����G���h>Gc�-��h�����p�|�-T�I���Uk��O����t�����F/��7Ih��c�)B�`����(��������V����7�<�5��Mi"����.I3a�"�$,������;��}(�����T���m�~;����8Q��p�&G�n���B���'O�1��f�����6t�����v���qp������F����g����S��a�J�2Z��X�L��4�(\�B?n��.)\�`����R�j��\�<���(����@�f�Y"�j{UMU�
34_Qo$T�#��!*8yZK��d3��-�
�3@.�xP7l���c�pP�8���GN<uV�_�,k���U� �[?Ef[��T`��
~+sv��|�S-(������:Q������x�,	P����[��i��^��'H�x���G��!x��>���j�����f�D.��`�(��-�� [�rw��j(�i�=��i�sWssh�:��N��Y�Uq��z��[�
��
�����;���PZ9?M��
|{Kx����1���� ���[c��c'���7�e{@�]>���.�e,�FK�����c�=��p�����4�������@�>:K,v>4�u�FDji��GKC���m-���j����z�>�:QU#1���ggY�O����1�'�����:����]�N��g�-f<��x���)����	X�	�;$�6��m����o�
/~��6z����oc���Y��?����w��|sM����a��^�^�C��IK#��H��wV_t6�`n�(B���U��O�
s�;�s.��S,�"?5��m��M[��mz���A����l�QC��Xl�kFET���R���B�DtI,��j��X�%i������#����t�C'�E�	D���6��	��q��7g@+K��Y�2}��g	V �����3&�G�.|���3��`k��'�a�-Y�@�^�Bv�x�Y|�oEF��@�����������
����Bm����FZ�tS�����Q��O���9=��'L�R
�������k��L�.|�(�y}�K�6��]����YtC0s�F��%
n.#h,�/�V:2ee[�A�� %6,mj)�O����C�>���4� r��X��
�(��l���$�f	.e}�8�fA4��0+T)��-X6��3X��~������-�>\�F����-S=h�*�
� �4`{��DO������<k�]�vQ���C0���;K<LL�aHm��]�X c�a
wv�#-�H1�f�4�
�0�Q���'��?(�}���I��@\�G6���rP�9(�
��m�b�o?�8�&8���LO�@$';��Y��;�gc��)����"d�t�PX�I�4<U�+�U���%1��U+E7����A�Tx�����=��������� -i�z=�t�>���w�����l]O��_�s�@5Q��lpnM[{U7���"��v$K�U���x���6��1��LLu�U��G����H�h���	-�A����.�k�8�@��g�����k�cd�w�=t��>�����KB��4���@x%��v ��}F@^��]?�y�~J��������C?���1�)A�/_�w�����U"�#�Y�m��M�;�*�����@������8�����6J�uI�����K{����N����������y�������k`��rk�s�����V��yQ�
��\YK�8�	�r�yg�^a���\b�/(?���Y�%6���K/f��D_�F^���P.���B�y�oSw���-���U�5�<���^��hO[[G�>�Y�h�
7��|ky�da�%������g�p����7*z��7�_���Oxm��Ck55��Q��e�olr61���~Nq�]�I>5�z���zi
;O����0�\G &p�j,-�tx�y���+��tse�t����Y�})����n���������V�
�Kmo4z���Z��l7&�U!�)=K�������d��@���Q���X�=��:n���I��"	mt�ibsq�j�D*�m��*�:��)����q|L���?DS>��E�-^�%?V��m}�uvR%?'�c<�WGd�����=�
���9��z��C[��������D�W��4�~����N�u�����]�����z�
���]�E���`/����m(g��f��{��xd3��W4T3W�'��rn��'�_��I�y��u_&*���������(Z������&b�(W�G�AE�X�zJ��HN�&����m�|Xx���[Kw���=j���{��F�t��V2xC��i�_�����]gO�	F6�Wz��U���B��&��z�)��������������?���v�Z����
����UL9�<"vE�H��3p�������X

0��B�n�1�$�|l��Qs�x���([W�1��)~�,�c�V��Y� ��5�}�ax_�����>kZ |���+�$��
��������s�5�Ut����^^���qO�)��������<PC�S�T9�q���]63a��IJ����<[�Z��)��ax���/��=.1��xK�#Q�z����N����zo>-��}��8���b4�����+�x�x�	���������=9Y��#�:� ��������
�~�R�<�j����DT�zo#�cDj"pICN�VD��y�rqTSM���Uf��{ ���"R��SR(������F���6i�e�����j����P/�H�������AV�*��('m�:����x�\�
����s�.k�}-�G��'��*,c.�
�/#�E^0E�5���#��� ���*F&REA��URl��������W>R�|��%�A��
3��#�����H���&���w��:�N�!}'��p
��2%�7�O�9L�U��x�|���=o<�n3�6U�1��Lq��|�b������k�D���/Xf+��6�J�9e�g@pI����V"m�U��e����Cq�T/�{�
�'����1��d�`���YY�Z���mH�T�
���az�����
����D����bx_5�v8]�!�FC� �l�v�y�6+9y��n�����@���*
���W�g���B��o�U�Y#�K�	QwqG�L�W_�.����Z���v2��$����)�.g�\��3u�}#H`���������^ �8:��.�U�+�U�W��f�u�e�#��\{/}�D����C
e�vU�L���$d���&V�g���X�f����i�]������ZQg�KeO��B�:r��jR��%u��q��F-������$�(R���������������z��N��������6AeIj��<z�=�1A���\_f��6Q�H�`���%�Hk����p
����g����Zm�J�pNY3f&�a�����=|v*@
�Z)� �4��.�'7�V��&#��W�5���������\������)nnUe�~)���c���>�������P��{*N�Ke����]r��(�7�"�������'��N<b|��_��a�>��B(����97S�"l�bS�B�5�G�P�s�gXl���3��P��cjr���s|��1�K�>-���oN���a�X�L����?X����V�>�y����������|��m��Q�,L�/��&�����4@B�^�e�S-.7������Nd�����r���f��)�aOG�������5�6�5b�A?��Dy|�KQW� ����W�<~��Ko����(�s������=	���[F���X��'�ya>YZ�J,�\{EP�d�z�-���\G���{h\H�<�2��`Y,Y�O���
g)��>�Z��pXY��������7k����

~(�(hTs�������]���"�xh .:��	I��������|����t�)AI�OQ���aW�����g���l�&5�l^2?y1���Td����1<�D#������#ZU��v/��Zx�euSo<�Ur�o����^��D�^�%Uq�/��`������ ��	
�
������L�Jx%���(j^qf7����p�,�������
�X��eR�K�F2��5�
�+��kF�����
S^<)Oh�����c)E�{+5�fMM��0A��&�W���=�9'�*�EG���������g^�/�o��C��d�M���G��J�hW�21�m���:�k�z_��xh��GP�@��
��;����?E�����W�;4�;��;M�;���|�Gn��[m��@ZL���������1�����F����C
����~[!A�=pNi0��n��o�|�xn��k%yOpepm7p0�6wH��]E]C-!L��z��8�i1��	��8����M�7���q�t�%!�=��4��^"9jx0`���S�5�����M��&������	�Z��s
 L7&�F6aW�M��U�4p��~u?����{��t��rx��Nk�����auI&�����Z�a��N��u���jt���&�6U�Z���	N|�+�D+��i�d++��YTU����C���q��J���G~4pqR�Y��}��n%���`�j\�H>E$�$A��H=R$@��Bo�W):��s��R$�I�i���l��d���w�_Y�%c������
��5��I��;�G�X���V��*�N��M�������y���Ux�����o��mi�v�������J)^��l��(0��G�}0��o���c��!5Z��z���������y���9@���NU�~&�J�*��SB�?Bw���
2�H��=���H.{FKV8����.��}�u5��V��4)���k������.�M�[2]��	v&��M�Ck�E�8��b�.�b�($X9����&
n�,:)�@���Y������2	�d
�2w=�E�~ U������M����F���jU�kMS��)a�=�n�Tq����$*�T
l�Y6��sH��Oe�����	����v�4�6�������4��������it���o79�Sy:��}5��/�����9�Ev{Z�c+�y���zN�W�y:��&�_�|�xrj�OYP�J��� @>$�:Q�� �c�v���DO^���w�Ib��=�g'u�"]f���\*�������i$N%��h$TZG�w&����.��8m�LB��(�v5����L�qY���I���������J.F-ET�{e����~W���sy�y�GV��u�s����P�8H������(���v��>RN� 9TBaF����r���~YP���a�q����p��
*��	D������G���R���c�Q5co���
O�]���������U�S7����=��q[���F)Ri�!�G_/��UpI���������m�`#���|��,�������5�I�{-�B)���Q���g2��P�Sp�g��x6\�N�?����"h/�4����\mP�Sl*�+YJ�)���S@�K�FTD�G�f�m���H���������T��Ja��=',�(@S�.W�A�l��c�0�[)5,�;0
7���6���=UKr�@b��Yh��Y�#h�W�3��h�-Au��k���<�uE������#����5��'��i�EW�Y-T�������j�������A���V�a��`$O[�<\�|����z�f�`���|�4t,yUj�Q��RS�Dt�Xm�j�Xfjc�,������;��$���l��rl|���C!�� K��]���%���V��Mc�wCy`A��\�Lq=s���"A�i���s�Y�.������F��X�aQ��7�r����`��Qb,��a��#�J�0�oE����U�������%�+�o�����TI��.������@���5)�)�9���� �,?KB7<I���	��R��^ p���S*���6���dK�z�F��M�c`�3z��6�t.$W&�48y���U.�e�!$z�u��92�R�����W�#m&=l�R�������\^���$�(�l��"��="E8��M�e[����`��+RZ\��u��]��*��J�-+�H��2������"o���v��q��&~�4�����Z���G�B��*y�|{������}h��������K��.�XF-*%y����,�8�v��!Z��J1�41�O��$l��yy
�`%t���!\=��[�w�q�/���d}��~�d�hT�.���
���X�D}�$�����h<��[IuCB���=��>�]��d�/�d7W~�������dR������'�C�l@�$�M��D�l�7��v��}�
jv��y�I6��'y�vP���H�R��)~�.�5�,��M��}-�P�bZ��"�X�����E+�}�����n.���Qe�M�e��^�;S��*{��](��q�x����*���)��9_��OU�!�n����f�F<��7���1�>�xp���K�mv�Xi����o<1�S�%FZ����"���^�fa������x��PY8�K��qT=��&��7�C���|K�)�.�!�������4f"j������n�����fFl��u����s���aWF���t_;��+4�z�O���zp�<���:Z��	��D9Wb�N�V ;�h�����Zp���v�N�g�q�D�Q��4r�0#��������5����^��u7h=|T��p
a�#��4>��~w��t��C�gtM�,��,6�x���%���ntXW�\�������E���ptMA���_�v>
d]��Z_�������v��~X-)�_��;�������JL�����jW.�*Y�����xU?�B6X'��>;�&�Q�.V�,&)_�M����o���z�q�6o��Y������DU�%�wx�`�z�f��u�������3!p�7��wp�H�����[��x��	]�;��_�lQ���+'L��2@�w����f|�������\�s.!��)�F����%��K�K��8t�.e!�}����q�i����;Y�����T6����K������R���������+�M��V�����v[)��!!�Do���)�����_�$���n~QdRNeiE|UrHK�u��|R�OL�����&#O�_�����:N��+��MM�B�9?��)ri�]�m�_K�'��H�%i��7+���R����bk������vu�7g���9��m<9n6#���z�����:�'�?J2��r��5Zs&C����Oe��H��YZ�{��x,�.��
o����>s�{�[��D�I�`��'����L�m���J�U�i[���R�kv�^}�i��T`Wu3��8E.�^�8��Ri���,��1�+�t�p@��x�\"p����������F�%�'��(����?���^O>g�\&��^�^��1MC�#���Gd8:��m�c��VH��K��*(M{�c[
��B��?��ybb��r}#�|4f�������#���F�u^_�����*E��m$Gj�{8�t�����*���R���W����(��|���?��V�_����V��Z���+��
)��_"���M�Y�x����
���O|�8�q\��K#��F��4P��d^j���kybST��z��f"�=`�����jF����b#�H��Hh�N���c=m�������]?��u_���[��2���pI���Ehv�������p��{�����=������'��fn0�\x�`X�8�tX����������F�g��o�@`-��)>9DV�����`@R[�(����w;��R�S�6���A��!����6�g�D�cP0(&z��������!h���R�9<����b[�Q|�����	��,4��	���qj��}\���	��o���\V{��f��f�\<�����������aM�A����	�%b��S��o�n��w����YC���K*d��z�5����W�Ly���iF����r.��u���~K�|�z5�l����q���i���[������h�!�2[�[��� �F��m��J����F����hf���e������LTH8���
��Y��ga�\j����+��@��+�
��`Sk�� ��t��A24�6���)��j{�N -��(j0@� ����8���D��J��@�N�\0<.C?���ew�@�$�~�=�"N�P�X��� VR�><��)�_k���)0��,��w��I��^�����B��������Q�eM:��=:"*�W$�(���9�(��7QH�E�����)��:���j"2�d�����h���\�#�i�#��`��i*vq��*1��
��ka@\��u3M2�i�A��U����E���		.:���aGU[F
c��*M�M
�t-����Kw�<���Y��l�:�����|�SiT�O	���nWpU)E�%$��}�~���u�.!���q�dzq������Wi�n 4�tT������j�����U�����E����*\��+=���`Tv����
�Pp��}��W��R�D��7��+�������������������g3�fuT.��'�k� ?��{@�r�Em�[��m ���k�k1�~��v����%J|����hKMr�G��Dm����kw��,.�H'�$c�[����P53��\�����(�t���H'g�K�I=��},8��3�)����BH�����]�4���[�����,c�T��y��t�`v��i�w����[�Z�S{h%r��h��"��jOt�c������nr����:�#���E����!�!=�|-W!�����\Y�/q�O�U���7~���wm �0��1���������^��R�2Z��p���.PJR���Xs
�A���+3�<H�E����gq��N`��JT����[�(�����NW�H:��n�z���+��ds���w��B��������Pr��J�y������������o��57�|������9�����rv����mS5��6�/}�s�cM{��T��A����������9���6A��;a�+T���f�PK�,*NW4h5PK�PaTw8 res/gif/v4-3socket.gifUT
D�b�bD�bux�����;���?���a,c�5��/�B�c��-[�!�TB*�fF��R��,!B��5�g	c+���������s���s_����u����<������O0��?A ���RRR��������733������sqq���9���+Wn��C&�i4ZVVVaaaiiiME��������������/*Z+�[*�������[ZZ�j*��+^U|xQ=�V9�T���n��������\���f���mE�TE�\�����M������Z^4/5�/�6/��|_��jj_n���x;T��[1�S9��b������w��������*�f+{��gGZ'_,.���oe�}�������5���kZ�l[�o�Z�`��VVZ+GF��yS=�]5?X���aq��4T99Y97^17�bq�~q�}�u��P������&�ue���ghh��sbbb~~~�M���7���7���_����4��ylds��������/s���g����w|��^�48�18�y������D����������O��[�_�������.�?������s�����/k����_�g�}��YYdo-�v���������>�|�?������C�[_`���l�����������������o_�vW7�?�~���c�������?�v���|[����������>{{{�{�������������_���k������G@,��M-�+.���_[�:	)�����
M�x������t�E������,�J��O��t���?~`|������w�1	�2���Jl��|*&��JN?<�X9���>da�������)2v/�f��'*}�^V�7��>�k�Yh����0��|����g�.3����}'*�
�����8� �RV���M�8P����������q'��y��	��^!��!��OK��w��(\�3��
�
�PK� 19�K	l����#�?t@�1��i�.P�z�mE�m]��`$h��s��=�+�9,v��s������	F�������wM����5�{�W��i3}��!@�G'j���Q�;���1Q.�>�MZ�W\~:F���p`$$"�#���D�l���%x����z
�Q]J���)�-��Y��,�*��ON����B5+`~}��@0��(	��o��������������(a(����*�I8��)�~�soh�y"�4Z��H�ssJ��������,�~|�i�u0�2�������3@;��j�p�U.p�������@�R��G/2N���tL�	�a��79�:�a�)N���H����`��G�Kmn�[���L������b����t��9��@���1}�9�0!�I,�i����x
B��L6������a��&@-y�����KJ�]�������: ��r��x	�I;�a�v�������B��o�
J�m��zG3k�V�V9��tp�}����_h�tl��]R�F���V����!od�T�����������;f�����P�������#���
N����(�9�i�����tG�Jt�1���w�~����T"�3��W?�|�������4�� `L���*��Z~	p0��k3��y�z�����0���� u��+p���
�����oO�'�>�E�x�������0���k��2�c��
�M++9)�w|_���:��A:!�aF�X��#��L���A�������$H[��/z�A���/�pc��H��
]��.R?�����{g&�^�Cm-�����Y��e��������)�u�����Fd���g�1W��
��U]>d��~b��uN0lC��f��h�?z/���_�!�W�u�:��~�D��;L�_�p�����`U�M_^�i�>*�U|�.p��[���:�b�����?���S�����&=�Z<���v4x����V�]3�O86�,���������������+AZi?k�eT�W����g��i���7O`��2�=���|N��g�ru;�s�]\����Ag���Ht-G�=�pR��6y�B{K����#Bx�����$K*�#�+�y~�Q/*�4P���� n=gR>WT�1��(M
����sg���t)����zi-��L������Y��x�=-��n�� ���i^�S����0y;�����o�/s+�k�a{G���U�������@�m]QOtt�2i�r�[J��o���c�S���n�\���i�+���&9��E-*�m3K}���F^�
-�(�����=�a�%�����YC7
������Lw��J����
��]]����H���/2E%J�����������o�e��T6�:�=����N���v�9�f��q;��P;7��!�x�6�Rt�#���8���^x�?����v����`�1�4B\�=}�.=��`��
�	�U�Z1����Z��^�P�%3�,�/�4���S8���H�Y���P&G*{����C�<��ya���:�Q�_����'3�����J&82��N�Vi�T�����+���z�2��?�����P���%�K�Zv��Z��r���n���-FR/>1K7x� <[�>�S��*���5�W�;i��-�������t~�L��6|��]�&�r����[�����_IH����D���8���`�K���3�����O��_y=0���P����=�1d�������*P_
C[������`�rP���G���5�17����������Q�
�T(4Ra�8�T�&�;9�0�;��'&��e2���2��]��9�$M�v}~�`&%5#�B�������&���Ku�����Ot��n0��];�5z������?	N���i�n��0=���V��L��U��&�:'��|�����69�����5���V�w����H�O��X����6���D�0s!k"s�������|�v�C�c��8X��y#��w����/H�rv�*z�-��R��$��'{~���I��?"|�<���'b�>�8�u�:9v��������}�K�{�s���������Hj�����oFrZ.�{�]k���@(^
����^�����n���3����!�L��"�?��l�#�!�cOr.�%�����w��G�����1�3>�XG�K��OPS$�X���hc�c��	�P<=��lVK�b��A	�[;!�{�G�$��W��r�(`G�>~��k�D1��f���Y�?
�H! B��[�t��L��l��������������G���b��R���Q)��)���������Zq�cZ�'�e��Oka��fh�N��n����������Af�������Q�{�Q���M{�{0��8���0�'l��|�8m�3��X*G����������(�h��gz4*=����b"��d�N���5��ez����MKg�I,�q��#/J���#Z@m��l�����M���W�Rt�6C]�D&*�q��d�{eV ��2����3�a��E��P��l��l+F����mOp��RO�~�*�d??�������$eze�W��9!@����4��5@�����I�q>�DzB�������y�^��I����`BdEn�RE~�������@��0��@����+7�*
��S��5�G����B������������s%�F��c��E�`���R����.�=�K�3��0�K��� �����CG����K������f��sq� J�,��c����y����
��d�q��&�3�{:�@WslOr>���U��z)Cx��Up�"
�����9���g��!�tY�dr*����d�j����������	EE�s`��x�Uh��r�Q0�����u
��rfs�|-!g���KT$���	�p�"=����3dCV����$�_k���0E!J�v��x���	�����3"_���x�	wD�q�5��3�w��N+�r�y������9�� ����b��O���A�-��?���Q�gA@�%/���Q�@�j�Th{'+Q
IUC�k�C.p�����p���:_1g���!!�������+�t},Y�^.3�U����$��c��{�u�������h���j�r��ZI��~���:RY��@�Go^k��(�xp����TE���w�=APao��{qL�Q^�yw���Y���3���y4�|�������L������w���!�G/�����*��O�o������2�B���5���0;d�Wo9���/	��:j>�!5���7�	8�B�eCX�
*E�gC<�N�FW2/���=aH��X���8[�5b��Y$�G���xrV�D��]�P�Vh������=���]�)��T>��X�%W\�6[y�������i~���
�fLS�>��\�f����A�+L���sf�}��g�m
�/*p���-��\�o��s5�����B[�4����R�5��#�@��PS��&u��]�w�n��i��J�|��j-}���<��d.-{,�,OXu�z������6������M��	����`��#�������Z���g���A[Ed-m�H�T�������X[�T��j�����Cp�C�����?
�F�KN��(f4�����vY����$����Ex��.~���q�c�V�L������E�r8���J��;��c����:i��������]�Ij�7x����H�������]S�e��B��Q���"l������4�W��h�:%p�����Q�FgIA���Sol�q��Y�)��hi&���:@��j�U������_
Q�:P���mo.�s�*������_�<�h�,W�Q=W�H�8m���_������w,�����|������O"x2g����}�tG�Tk?�Q��.~}s�%p�;�Zg��[��w�Bu����|�?�m#]�]F���N����
��H~��I���=zz�]hx|�ct\k�����;6�����s%e �� ��P�����I��q��I���N2S.'��|��c�<Ny�O�,L���"�L�����;�p:�}���t���0��'�G����>���;��� ��}�����](`�k���j�#��/2��]W;�}�0���#��=#�����#��,������{F�bK��C��\q\�D��4a���W���8}bVdnO�������������@�}Tb!z�`6��1sOq���b!����	]�|g+��E������#�c��f�	�`VX���'%������C1�s`��2�����>bAGhy���?��=t��Yp�Nh�e�;������G������n��@�,�LW��8Z�1!X �>�7����U��U��~�h�����`{b�;��A�����W�
-�����������Z���;,��%A���0��H�P�RM�N�T�������!Qk�#�X?���OG��h.~P���<��������^Xk@"x�E���k�s��~]N'.�8q5�/>rX@ ����dpQ���u�X8	3��	�$�@q�-!��*��.�H�V��V��V��v��v��vo�v�������~ IT��n��	P|'hrr���h;V����W�Z�t'�	1cA�G!'1P�V=�z������j}������������8�m�����m6
A�~���F���t�F�s���l��	% ��@�y��[�#�����*�$������dn����q}8-�%DZ���o�8c��[m#s�K[D�q0�_������>���@~�"0��;�r������������w��_������$�����Y*s���=�\���- �D5���*p��<�g��Xp���u�w����i{9�����GO��~
��P�����|n��,\�P�dUje�c����}�x��I#���Gk��{V��Y~�V�+�c����*�:q�mSXf������P1����oO9�)`����5[���o�P����]�|e?�����$�;��W��u�|����M$�gaH<��|9�xp�*���V@��'�y�*M������M�����.����
�5f+��fw���l�
�4~�C%�7rH���r��
�0�k�k��j����%��+��7c%���]�"4Q��W���W�
��m��7<��u����?��Enq+2y��'-�������[� �x��d��k����#�DS��mF������I���@�{�_���dI�n^?C�c����kG��QBG}�m�����������@	�)��^�im����_#��?���-��9�B��4Evo�K�[�^������R�=���������f�����%8F�C�G9ay��HU�C���X�?c����]����#�3W���J��{�E���5�r����R��Rk�t����\�5<"+��y�����Y[�&�dQ~�v�}fg�����]� �c>��O�J��M������
�9f�ts����X�[�;[j��v��^�00�(
�|�]������N��������u	��L�j8��O��o9��hs����	��pQ
��s
��5`���4��/�.EO��b6��H[�:�G6��k��Sy�w_��xg�m�$�Y�p������Z�b�6���9�s�H���P��mu�WdQ�����(b�G]������w�������i!���v&�A[����vCe7������'��1�Ao;fsy�d)?��x"��d}7����}��eF��.={��T#���]�`�T��]�4}�����-)��:���y]-�����/�����{1�P��lr�r����*v���z\�U��5��5\�+��r��Mxr&IJ�=��!q>����i�yvQ$�������2��QX����y��X�X^k��0�g{td���h
y�������c�:�-�i�'c�[���Cf����m^����[�����n�om!�h����B��\H�)p7�`Vm��w�`U���dn�����=�$�6Vf�B�hq��t��`��h�,Z�+k����)�O�>�
8���������8�J<UuF��AY��Y������9����������8�����u��N����0�*QD[�/�����������U���Q?H�������b��jV�@o��7���@������\>Z>'������"yV�h��m�;x����H)��]U�A�K(��2��%�2�������X�"�@��>��q~m����_�G�>
���F+qSJ�&�*g��1zo&<��{HcA9�I�o�[���.A�.B%K���~������	0!�����#��.S��a�n����;�rP G��?8�LL`1C2����E)�EDE5���KUI��k�4^�C�S�nX�~�P6����_�$�s���$/v�cu6.��e(a	'�=#��v/f:v?8�P��I)��c���@v�TWi��<����'Tj�~�%X4a��=;���;:p7{B�� ����n��I�|��C����6<o� w��e��jWw����0<����X����=���g|&�����������OA_�S	\���b���=Ix%���b� 0N��n��cQ�Nz�b��et��&,07P��2p�j3��V����>#������� �D��X���Y~
�Y7qWI�e������P���h�*q1B����_q�z��Je.r�����(t��{���P\��&HrX^=-Co���z�A���}��	8���#_�,�����f_e$��W�G�C ���#DZ��A��Z�}��me���(�N��p���x]�,��92��e�A���A�1c�����
�`(
�0��^�E��q��{�^C�2���I�/��2��7j�W?��Y���D�!���Sf��ML�	��X�����fpS��+��,�D�5��D���;���Q:��)� 14��R��~{����z����x0B�*F���h��q�K����tFB�vq�����{�����M����$������d���adi��!�)�F5]�	3~�U�<��k������
5�����	��"��]wf9-��-����N�6����x��!�>��G�BC�Oek5��������W��uV6�t�����f+�<S^Xd�b����xa����o�
=,�������?��y���W��P���1j�f3������+�B�L�I�L��}"���w4c�HAT1��z`/f����'���"n��G���F���D�m����������Gh����7_�,�\���3�M9uwEl����\�>��|�&r�0e�p�: ��`���[�9�����akyK��u:z0bx�0
���3^$��Q�x�8.���?�8;a��o��]�@L��=�
@1psu��f��v�R'">��L�'q��+mQR������A��M��N�F]��N����l{�C�*K�T�����D�s;z|{=U�8U��x}��?0u`e>S?����b>'d��GI�H��',���.w�G��� ��e�����/��
�'W�7����#he������d�<:�x8Y�����%������g���j�(���}���������cpyP�b��d��c�
�����YIN-U���'�A����$lPm��6��� m
S}=�wv��Hrv}��$6d��L�a�N�J����?F��d�]�&a���H�S��mPv���z��j��sC-�(��X���kn�'�bG��N���'�����-�=�#��g�2V�$���6�����Q�q
Jh�xE���ql�[�1�+���;��<[om"@�3"��6P@7����h&.�a���|��8����������Yrf�xy��iZRt�w[pK3�"ay��3�,<��~����V���=�h���,�S%�G����c��=������������6e5H5��?�����c;�����8ohH�9�v>�R�`g�)�;��>��~�������&���w��r���q��e�R7p���X�J�z��a�bny�X*��-���xW��A��x��� 2�i�	����;�g�dzV������8q��=��lR��>�r�n"�Y���3p�O]�^����D�)P7k�&{0?��O���A�����p�elM�5�rv�O������K�����k6����'��F
f�++4z'I��H2Tw�����tB�R�dnbK����^LD������q�K����3y���I)�����B
p)B�'���i[��C�+�������@��o�QbFsR�3��	��$���{���r�W�
�\D�����wyz�F}�g�%b�FRa����p{W�d��Cx������3��M�u+d��_��R����J����������������Y���&�'�~����^�`��xe�k<�'�G���������s�T8�����[a��t�2�}�|��\���L��-�D
�jaso�?�<�c����b��ZG�Qf���yh��{�q�M;\��XF����Pln�K�,fa*���69K]�N�i���0�[m���v��yo��o^�#�I����m0���7�#n�zp�]���o���I�����?�|��S�������<�U�T;���8�������Y��p���1�x������*}�,X���o7�.�����������������������DMFR�9�����ubR
���G]#}���)B<��a�����	r���t�Y�L�9�����-aZ���2�^�0�W�]f�	�6���3�rh�������f��U2�Q.b6�W�G����3������-_�*��Z?Z���/��a�c�z�4�I�F����	Bk���v���}���{��o��wY���@����M��&�j=K���&���n�C��7Y���Ck�Z�,���y��.��1�lKH�3J���9��a<��@'�Q,jF�����M��n��_�|���{a��"�}�������/�Y�����U
S��^��_�����K��?$���g?����|��/:��9���T_��D���3����%�gU^��M5�;x�0����<��`�N� �s�SX�0g^+��k	Q����������E��X�g����R����J��j��!��,l[^��5��o�b�A�����,��*�|�A�����
|���!�x��:��n?
j��(\����E)�bk���#t����\K�=��i��)��_m4 �I������������"��|��P�c����Z&�:ON�h��NU�NS{7:�����vx#}����S=]���Zb��zCy��J���7��<Ur���	�>*�f��P�L�t6'�h�s�U�&��7B}�N��r%�^�������^��	 Z��0
�7
���is�B������ ?U;
C�0�p"PnXd
/��R\p����{���C��8Q���9{%�t�0|�����rnaW�1@�3�oBE��S9�m(|���rc�Yh{����Xm����L=i4Qd��	V���9��~t�
�dI�{���)>:��D*��u#����I��b.�O����f|����-�mG��w����x*'q����1����%;�
VeQ�W�>�I���}��x���4�.U���_��9������IG�lDU���;G�������h�<,�n2�Mc��X�����������9������iBM��=�U�A�=?[�lB=�6���$HZGi�k��'0���G��!�:�t��kXo2����#t8/��T����n���x�]*��u������#e|�����L8:�����3F�2�xS��Ts
�����e����/i[v��@�c)��/�<�v�d�o/A4<�H���rI���_�`G��d���������������`5&|%e���'�o�����M�D��f�2-��) ���S��$��L��w��_J�p��K?�R�*�bl���
�`��
����dGy2��pe���E����*����9�f�BO� z�%����B>�	���L�(Q�]tH�b���TBj����Or9Ikz������w������d�9��%�
|���_�C��Q$�F�e��T��"�����L��.U^Sa}'d�����?�%)��+��w�Q�;�����*�j�|�6%Y��o�����1{"a���������������?�(��^%-�s�[����|��~Y8�5
'��D������l��#����7�#���g�%��k����}�!�����H��O����B���������"�$i��]N��"���4���1j�m]�_k��9��R��R@t`�����F=N0.��o����xtB����$2�����������CZ&�\w�����p0��F0�M��������
*�T��{S%�"
m@d��� |�����67&�0���M�����������\��f����r��.������T���MWy7E|9���5�&�(�s�K6�F�Y���k;w��T���k3O������H�S��N`��W��.E���Le�����>���B"G;��`���*��{�tM�Q{��`"1u���ug����FlVw$�)�8����09\����x��s����m�h�e��)��\_���#����Ma��b����O����d�B������"����L��(��D%����-g�H�_l�\�Ds�V�N���NR��N�w6J�0�s:fTj,�`.s�H�%�����4t�}B!��\t��e����`�K!�S�bUS�%���b�*�*�����zB���!|(�UM�Q���_�4���m8�_������g�/���F
c����m��%�L����R�������CZT�D���\t����w��i�n\J�3��D�:��8��sT=�vR���T���V��=.��������F>�`?��
A�1E�z0M]�F��?����b(�j��7��n����������������+�m{���ZP�K,����a$�\l^���������*�)t�h �PwP�w��<xs����ez��)�;�k�<����a,�����`��Lh��0����'w�}yf��GH< �����	��������~���7s�uxtr����zH�_���cP��������4V|�G����������#����VJ��q+�q�uLD)���4����3Nls�~OH���F�����{��2�K�]�O����?���a��;G��}v�_t���q�j���.U�(@��F�/�+[!��W���S�6;��D�vu��s
A�����p��^�4����u*�YY���	b,h�D|�x�**��������0���C��yvF���a.*�������%��k
h�Y~�����E��kP5��`�
_}!h�}�or�~�,��K����p����Z���Pm�7/~���D���|w8�y���z(Z�S� jn��#���������?V����}���� (���?�+��d��i�������x9���23�m2y��p��.���p����`�q�xQ����B���d��ug�2�e5 @����\�`�������_���XF����1�"!.�A�
8��/D�++�c���`:����F`6l��"�����8��g�5��T�p#^y|�����
���VpF����X����<Ch8���p����R��7. �[�T�$
�N/0�WR��I	��������6%��E�,��Z�$�p��&�Y�6U��X���S	���h|��7��-�_Yq���H�t�����Xu�U�����5o��t�H#,c��N�,!�@��E� 2�L���k"�������Z�k)O���)�F������0�b�X*!�v���T%��b��_v`�m���92�����&���>���|�l(��	����6C�A/���������#Q/��p=[���mSD+������-��q'��"�k
:2~��r��z�#��"��������qvcK�t!O�1�aD�5�>��`����|
��������������\��!a���T0��zO~W�(N�
�-�V���`U�w`or�� ]H������"����!j����ZHPSW�l��I��'u���z�XR)8Z[��&�������]S�Tb�Oq����s��w;�OP�i: ZP�i6E��
��$Z]�Z�'��5��0��p_V�
�<���c�8�:Ph����bSPhE`r�<T�Bsab�����n�S��+��lsu�uq�g�����������m�A�o�
Aj���}\�o�����m-j�)��xLw�"}��X���`0��rR�������UA���P��u���_���2�������?�ze2������^���G�Jvjp�8-A�p'�o��-�^�p�z�cs??0�[|�g����pD�8j����Ia���1��r�I*l�(7�p���d��'��RE�Xt�����H?�>�X�
�������(�xnF�N�����:���w��c�"1��Y_b;g��{����L�?��,T��s���rN
�(C��8�0��z�|o������#?�7�^V;��fU�� �X����������q���W�x�@a��?>r>#��������&~A6����������
��>VoB����F'�qB���L���������
��W>���/=�o�}	|�����g��T=��e`�	�����[G^(l�4��#�l?���O��p������r{%�S����SO�1�����0p9�
��JJ�x1�X���B�,�T��7�%�q��������f�Ei!��Ev(RB�
��L�t�E��rXF����@�MI�b2���P���r�
9�O�F��
�!w�E�M�G��17}vl�������u��@<c����Vi�<IV���q�0���������?��R��1Fa���,��`�$C��*�����P�)<0!6���
0�F�*�U�5�k���(5��g��kL�0!G��x����2�����:��p�b����XY�R)NK�����\oU�wT�t/��S��Xj�`}��:N�&��&M�6��6���PKh|��7w8PK/�PT �Ares/UT
�
b�b�
bux��PK�PaT �ABres/p75/UT
C�b �bC�bux��PK�PaT���� ���res/p75/v4-1socket.csvUT
C�bC�bC�bux��PK�PaT��!� ���res/p75/v4-3socket.csvUT
C�bD�bC�bux��PK�PaT�t��� ��Fres/p75/v4-2socket.csvUT
C�b �bC�bux��PK�PaT �A�res/tbl/UT
C�b�bC�bux��PK�PaT�Y�w� ���res/tbl/v4-1socket.tblUT
C�b�bC�bux��PK�PaT����� ���	res/tbl/v4-3socket.tblUT
C�b�bC�bux��PK�PaT���� ��res/tbl/v4-2socket.tblUT
C�b�bC�bux��PK�PaT �Ares/csv/UT
C�b��bC�bux��PK�PaT�-��\� ��ares/csv/v4-1socket.csvUT
C�bC�bC�bux��PK�PaT�f�po� ��!res/csv/v4-3socket.csvUT
D�bC�bD�bux��PK�PaT��9�n� ���res/csv/v4-2socket.csvUT
C�bC�bC�bux��PK�PaT �A�res/gif/UT
D�b�bD�bux��PK�PaT�F�[6i7 ��res/gif/v4-2socket.gifUT
C�b�bC�bux��PK�PaT�,*NW4h5 ���Jres/gif/v4-1socket.gifUT
C�b�bC�bux��PK�PaTh|��7w8 ���res/gif/v4-3socket.gifUT
D�b�bD�bux��PKZj�
v4-0002-PGPRO-5616-Add-HASH_REUSE-HASH_ASSIGN-and-use-it-.patchtext/x-patch; charset=UTF-8; name=v4-0002-PGPRO-5616-Add-HASH_REUSE-HASH_ASSIGN-and-use-it-.patchDownload
From 22d9613accc70eb2f9e799b87e976d64540f36b4 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH v4 2/2] Add HASH_REUSE+HASH_ASSIGN and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ASSIGN then uses the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   9 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 130 ++++++++++++++++++++-----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   4 +-
 5 files changed, 120 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..3362c7127e9 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -128,7 +128,7 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_ENTER,
+									HASH_ASSIGN,
 									&found);
 
 	if (found)					/* found something already in the table */
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5d2781f4813..85b62463c0d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1334,7 +1334,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1528,7 +1528,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c79..9eb07593da7 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -138,8 +138,9 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nalloced tracks the number of free hashtable entries initially allocated
+ * for the freelist.
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,13 +148,15 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for
+								 * the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -170,7 +173,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -254,6 +257,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +305,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -932,6 +950,10 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
+ *		HASH_ASSIGN: same as HASH_ENTER, but reuses element stored by HASH_REUSE
+ *					if any.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -1000,7 +1022,8 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		if (hctl->freeList[0].nfree == 0 &&
+			hctl->freeList[0].nalloced > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1067,10 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* If HASH_REUSE were not called, HASH_ASSIGN falls back to HASH_ENTER */
+	if (action == HASH_ASSIGN && DynaHashReuse.element == NULL)
+		action = HASH_ENTER;
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1084,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1116,6 +1140,7 @@ hash_search_with_hash_value(HTAB *hashp,
 							 errmsg("out of memory")));
 			}
 
+	link_element:
 			/* link into hashbucket chain */
 			*prevBucketPtr = currBucket;
 			currBucket->link = NULL;
@@ -1132,6 +1157,63 @@ hash_search_with_hash_value(HTAB *hashp,
 			 */
 
 			return (void *) ELEMENTKEY(currBucket);
+
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+				Assert(DynaHashReuseHTAB == NULL);
+				Assert(DynaHashReuseElement == NULL);
+
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
+		case HASH_ASSIGN:
+			/* check HASH_REUSE were called for same hash table */
+			Assert(DynaHashReuse.hashp == hashp);
+
+			/*
+			 * If existing element is found, need to put Reuse element to
+			 * original freelist. There is no much difference, which list to
+			 * put, since we migrate buckets between buckets.
+			 */
+			if (currBucket != NULL)
+			{
+
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+
+				return (void *) ELEMENTKEY(currBucket);
+			}
+
+			currBucket = DynaHashReuse.element;
+
+			DynaHashReuse.element = NULL;
+			DynaHashReuse.hashp = NULL;
+
+			/* reuse HASH_ENTER code */
+			goto link_element;
 	}
 
 	elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1301,7 +1383,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1347,13 +1429,9 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 				if (newElement != NULL)
 				{
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1443,9 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1460,10 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = 0;
+
+	sum += hashp->hctl->freeList[0].nalloced;
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1473,10 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+		{
+			sum += hashp->hctl->freeList[i].nalloced;
+			sum -= hashp->hctl->freeList[i].nfree;
+		}
 	}
 
 	return sum;
@@ -1739,6 +1823,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl->freeList[freelist_idx].nalloced += nelem;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7c6653311a5..d35ee1b4108 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index d7af0239c8c..ba21c5a65a1 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,9 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE,
+	HASH_ASSIGN
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1

#22Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#21)
1 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:

Ok, here is v4.

And here is v5.

First, there was compilation error in Assert in dynahash.c .
Excuse me for not checking before sending previous version.

Second, I add third commit that reduces HASHHDR allocation
size for non-partitioned dynahash:
- moved freeList to last position
- alloc and memset offset(HASHHDR, freeList[1]) for
non-partitioned hash tables.
I didn't benchmarked it, but I will be surprised if it
matters much in performance sence.

Third, I put all three commits into single file to not
confuse commitfest application.

--------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v5-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v5-bufmgr-lock-improvements.patchDownload
From c1b8e6d60030d5d02287ae731ab604feeafa7486 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 189 +++++++++++++---------------
 1 file changed, 89 insertions(+), 100 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..5d2781f4813 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1288,93 +1288,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1382,40 +1305,113 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). Given these statements it is safe to
+	 * clear tag since no other process can inspect it to the moment.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now it is safe to use victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1423,13 +1419,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 9879b18d6fc0b6beccc71debac62470e26025f84 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 2/3] Add HASH_REUSE+HASH_ASSIGN and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ASSIGN then uses the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   9 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 130 ++++++++++++++++++++-----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   4 +-
 5 files changed, 120 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..3362c7127e9 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -128,7 +128,7 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_ENTER,
+									HASH_ASSIGN,
 									&found);
 
 	if (found)					/* found something already in the table */
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5d2781f4813..85b62463c0d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1334,7 +1334,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1528,7 +1528,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c79..af4196cf194 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -138,8 +138,9 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nalloced tracks the number of free hashtable entries initially allocated
+ * for the freelist.
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,13 +148,15 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for
+								 * the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -170,7 +173,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -254,6 +257,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +305,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -932,6 +950,10 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
+ *		HASH_ASSIGN: same as HASH_ENTER, but reuses element stored by HASH_REUSE
+ *					if any.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -1000,7 +1022,8 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		if (hctl->freeList[0].nalloced > (long) hctl->max_bucket &&
+			hctl->freeList[0].nfree == 0 &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1067,10 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* If HASH_REUSE were not called, HASH_ASSIGN falls back to HASH_ENTER */
+	if (action == HASH_ASSIGN && DynaHashReuse.element == NULL)
+		action = HASH_ENTER;
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1084,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1116,6 +1140,7 @@ hash_search_with_hash_value(HTAB *hashp,
 							 errmsg("out of memory")));
 			}
 
+	link_element:
 			/* link into hashbucket chain */
 			*prevBucketPtr = currBucket;
 			currBucket->link = NULL;
@@ -1132,6 +1157,63 @@ hash_search_with_hash_value(HTAB *hashp,
 			 */
 
 			return (void *) ELEMENTKEY(currBucket);
+
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+				Assert(DynaHashReuse.hashp == NULL);
+				Assert(DynaHashReuse.element == NULL);
+
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
+		case HASH_ASSIGN:
+			/* check HASH_REUSE were called for same hash table */
+			Assert(DynaHashReuse.hashp == hashp);
+
+			/*
+			 * If existing element is found, need to put Reuse element to
+			 * original freelist. There is no much difference, which list to
+			 * put, since we migrate buckets between buckets.
+			 */
+			if (currBucket != NULL)
+			{
+
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+
+				return (void *) ELEMENTKEY(currBucket);
+			}
+
+			currBucket = DynaHashReuse.element;
+
+			DynaHashReuse.element = NULL;
+			DynaHashReuse.hashp = NULL;
+
+			/* reuse HASH_ENTER code */
+			goto link_element;
 	}
 
 	elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1301,7 +1383,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1347,13 +1429,9 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 				if (newElement != NULL)
 				{
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1443,9 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1460,10 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = 0;
+
+	sum += hashp->hctl->freeList[0].nalloced;
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1473,10 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+		{
+			sum += hashp->hctl->freeList[i].nalloced;
+			sum -= hashp->hctl->freeList[i].nfree;
+		}
 	}
 
 	return sum;
@@ -1739,6 +1823,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl->freeList[freelist_idx].nalloced += nelem;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7c6653311a5..d35ee1b4108 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index d7af0239c8c..ba21c5a65a1 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,9 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE,
+	HASH_ASSIGN
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From b57d77e4d4a7e8b4277d0f88cf09430e1e3163e6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Thu, 3 Mar 2022 01:14:58 +0300
Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list.
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index af4196cf194..cd015abaecf 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -170,18 +170,6 @@ typedef struct
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -208,6 +196,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -281,7 +281,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -524,7 +524,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -533,7 +534,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -641,11 +642,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#22)
Re: BufferAlloc: don't take two simultaneous locks

At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:

Ok, here is v4.

And here is v5.

First, there was compilation error in Assert in dynahash.c .
Excuse me for not checking before sending previous version.

Second, I add third commit that reduces HASHHDR allocation
size for non-partitioned dynahash:
- moved freeList to last position
- alloc and memset offset(HASHHDR, freeList[1]) for
non-partitioned hash tables.
I didn't benchmarked it, but I will be surprised if it
matters much in performance sence.

Third, I put all three commits into single file to not
confuse commitfest application.

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?

-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for

Why do we need nfree? HASH_ASSING should do the same thing with
HASH_REMOVE. Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp. I think that
should largely reduce the footprint of this patch.

-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);

That optimization may work even a bit, but it is not irrelevant to
this patch?

+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+				Assert(DynaHashReuse.hashp == NULL);
+				Assert(DynaHashReuse.element == NULL);

I think all cases in the switch(action) other than HASH_ASSIGN needs
this assertion and no need for checking both, maybe only for element
would be enough.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#23)
Re: BufferAlloc: don't take two simultaneous locks

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?

-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for

Why do we need nfree? HASH_ASSING should do the same thing with
HASH_REMOVE. Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp. I think that
should largely reduce the footprint of this patch.

-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);

That optimization may work even a bit, but it is not irrelevant to
this patch?

+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+				Assert(DynaHashReuse.hashp == NULL);
+				Assert(DynaHashReuse.element == NULL);

I think all cases in the switch(action) other than HASH_ASSIGN needs
this assertion and no need for checking both, maybe only for element
would be enough.

While I looked buf_table part, I came up with additional comments.

BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
hash_search_with_hash_value(SharedBufHash,
HASH_ASSIGN,
...
BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN. That looks
odd. We should use HASH_ENTER here. Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#24)
Re: BufferAlloc: don't take two simultaneous locks

At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks! I looked into dynahash part.

Then I looked into bufmgr part. It looks fine to me but I have some
comments on code comments.

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.
if (oldFlags & BM_TAG_VALID)

We don't take lock on the new mapping partition here.

+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
+    *
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). Given these statements it is safe to
+	 * clear tag since no other process can inspect it to the moment.

This comment is a merger of the comments from InvalidateBuffer and
BufferAlloc. But I think what we need to explain here is why we
invalidate the buffer here despite of we are going to reuse it soon.
And I think we need to state that the old buffer is now safe to use
for the new tag here. I'm not sure the statement is really correct
but clearing-out actually looks like safer.

Now it is safe to use victim buffer for new tag. Invalidate the
buffer before releasing header lock to ensure that linear scans of
the buffer array don't think the buffer is valid. It is safe
because it is guaranteed that we're the single pinner of the buffer.
That pin also prevents the buffer from being stolen by others until
we reuse it or return it to freelist.

So I want to revise the following comment.

-	 * Now it is safe to use victim buffer for new tag.
+	 * Now reuse victim buffer for new tag.

* Make sure BM_PERMANENT is set for buffers that must be written at every
* checkpoint. Unlogged buffers only need to be written at shutdown
* checkpoints, except for their "init" forks, which need to be treated
* just like permanent relations.
*
* The usage_count starts out at 1 so that the buffer can survive one
* clock-sweep pass.

But if you think the current commet is fine, I don't insist on the
comment chagnes.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#26Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#23)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:

At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:

Ok, here is v4.

And here is v5.

First, there was compilation error in Assert in dynahash.c .
Excuse me for not checking before sending previous version.

Second, I add third commit that reduces HASHHDR allocation
size for non-partitioned dynahash:
- moved freeList to last position
- alloc and memset offset(HASHHDR, freeList[1]) for
non-partitioned hash tables.
I didn't benchmarked it, but I will be surprised if it
matters much in performance sence.

Third, I put all three commits into single file to not
confuse commitfest application.

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?

-       long            nentries;               /* number of entries in associated buckets */
+       long            nfree;                  /* number of free entries in the list */
+       long            nalloced;               /* number of entries initially allocated for

Why do we need nfree? HASH_ASSING should do the same thing with
HASH_REMOVE. Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp. I think that
should largely reduce the footprint of this patch.

If we keep nentries, then we need to fix nentries in both old
"freeList" partition and new one. It is two freeList[partition]->mutex
lock+unlock pairs.

But count of free elements doesn't change, so if we change nentries
to nfree, then no need to fix freeList[partition]->nfree counters,
no need to lock+unlock.

-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);

That optimization may work even a bit, but it is not irrelevant to
this patch?

+               case HASH_REUSE:
+                       if (currBucket != NULL)
+                       {
+                               /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+                               Assert(DynaHashReuse.hashp == NULL);
+                               Assert(DynaHashReuse.element == NULL);

I think all cases in the switch(action) other than HASH_ASSIGN needs
this assertion and no need for checking both, maybe only for element
would be enough.

Agree.

#27Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#24)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?

-     long            nentries;               /* number of entries in associated buckets */
+     long            nfree;                  /* number of free entries in the list */
+     long            nalloced;               /* number of entries initially allocated for

Why do we need nfree? HASH_ASSING should do the same thing with
HASH_REMOVE. Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp. I think that
should largely reduce the footprint of this patch.

-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);

That optimization may work even a bit, but it is not irrelevant to
this patch?

(forgot to answer in previous letter).
Yes, third commit is very optional. But adding `nalloced` to
`FreeListData` increases allocation a lot even for usual
non-shared non-partitioned dynahashes. And this allocation is
quite huge right now for no meaningful reason.

+             case HASH_REUSE:
+                     if (currBucket != NULL)
+                     {
+                             /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+                             Assert(DynaHashReuse.hashp == NULL);
+                             Assert(DynaHashReuse.element == NULL);

I think all cases in the switch(action) other than HASH_ASSIGN needs
this assertion and no need for checking both, maybe only for element
would be enough.

While I looked buf_table part, I came up with additional comments.

BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
hash_search_with_hash_value(SharedBufHash,
HASH_ASSIGN,
...
BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN. That looks
odd. We should use HASH_ENTER here. Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

---------

regards

Yura Sokolov

#28Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#25)
1 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:

At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?

This way it is possible to allocate just first partition, not all 32 partitions.

Then I looked into bufmgr part. It looks fine to me but I have some
comments on code comments.

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.
if (oldFlags & BM_TAG_VALID)

We don't take lock on the new mapping partition here.

Thx, fixed.

+        * Clear out the buffer's tag and flags.  We must do this to ensure that
+        * linear scans of the buffer array don't think the buffer is valid. We
+        * also reset the usage_count since any recency of use of the old content
+        * is no longer relevant.
+    *
+        * We are single pinner, we hold buffer header lock and exclusive
+        * partition lock (if tag is valid). Given these statements it is safe to
+        * clear tag since no other process can inspect it to the moment.

This comment is a merger of the comments from InvalidateBuffer and
BufferAlloc. But I think what we need to explain here is why we
invalidate the buffer here despite of we are going to reuse it soon.
And I think we need to state that the old buffer is now safe to use
for the new tag here. I'm not sure the statement is really correct
but clearing-out actually looks like safer.

I've tried to reformulate the comment block.

Now it is safe to use victim buffer for new tag. Invalidate the
buffer before releasing header lock to ensure that linear scans of
the buffer array don't think the buffer is valid. It is safe
because it is guaranteed that we're the single pinner of the buffer.
That pin also prevents the buffer from being stolen by others until
we reuse it or return it to freelist.

So I want to revise the following comment.

-        * Now it is safe to use victim buffer for new tag.
+        * Now reuse victim buffer for new tag.

* Make sure BM_PERMANENT is set for buffers that must be written at every
* checkpoint. Unlogged buffers only need to be written at shutdown
* checkpoints, except for their "init" forks, which need to be treated
* just like permanent relations.
*
* The usage_count starts out at 1 so that the buffer can survive one
* clock-sweep pass.

But if you think the current commet is fine, I don't insist on the
comment chagnes.

Used suggestion.

Fr, 11/03/22 Yura Sokolov wrote:

В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN. That looks
odd. We should use HASH_ENTER here. Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

I've merged HASH_ASSIGN into HASH_ENTER.

As in previous letter, three commits are concatted to one file
and could be applied with `git am`.

-------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v6-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v6-bufmgr-lock-improvements.patchDownload
From fbec0dd7d9f11aeaeb8f141ad3dedab7178aeb2e Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..63824b15686 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recency of use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 61bae0e93ba71a24890b9266e02e876d1306b5eb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 2/3] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 126 ++++++++++++++++++++-----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 113 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 63824b15686..204cebe3843 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1340,7 +1340,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1534,7 +1534,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..0cb35a0faf9 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -138,8 +138,9 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nalloced tracks the number of free hashtable entries initially allocated
+ * for the freelist.
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,13 +148,15 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for
+								 * the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -170,7 +173,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -254,6 +257,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +305,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -932,6 +950,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +963,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1025,9 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long nentries = hctl->freeList[0].nalloced - hctl->freeList[0].nfree;
+
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1071,11 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+	/* Check HASH_REUSE were called for same dynahash if were */
+	Assert(DynaHashReuse.element == NULL || DynaHashReuse.hashp == hashp);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1089,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1113,21 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,14 +1136,44 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (likely(DynaHashReuse.element == NULL))
+					return (void *) ELEMENTKEY(currBucket);
+
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
 				elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
 					 hashp->tabname);
 
-			currBucket = get_hash_entry(hashp, freelist_idx);
+			if (DynaHashReuse.element == NULL)
+			{
+				currBucket = get_hash_entry(hashp, freelist_idx);
+			}
+			else
+			{
+				currBucket = DynaHashReuse.element;
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+			}
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1301,7 +1375,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1346,14 +1420,11 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 				if (newElement != NULL)
 				{
+					Assert(hctl->freeList[borrow_from_idx].nfree > 0);
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1436,10 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
+	Assert(hctl->freeList[freelist_idx].nfree > 0);
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1454,10 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = 0;
+
+	sum += hashp->hctl->freeList[0].nalloced;
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1467,10 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+		{
+			sum += hashp->hctl->freeList[i].nalloced;
+			sum -= hashp->hctl->freeList[i].nfree;
+		}
 	}
 
 	return sum;
@@ -1739,6 +1817,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl->freeList[freelist_idx].nalloced += nelem;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..2ffcde678a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From f52411fc2043ad6aa301a42aa3693c968d5c0825 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Thu, 3 Mar 2022 01:14:58 +0300
Subject: [PATCH 3/3] reduce memory allocation for non-partitioned
 dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list.
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0cb35a0faf9..0a172005059 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -170,18 +170,6 @@ typedef struct
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -208,6 +196,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -281,7 +281,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -524,7 +524,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -533,7 +534,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -641,11 +642,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

#29Zhihong Yu
zyu@yugabyte.com
In reply to: Yura Sokolov (#28)
Re: BufferAlloc: don't take two simultaneous locks

On Sun, Mar 13, 2022 at 3:25 AM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:

At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <

horikyota.ntt@gmail.com> wrote in

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <

horikyota.ntt@gmail.com> wrote in

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in

high-concurrency hash

Why did you move around the freeList?

This way it is possible to allocate just first partition, not all 32
partitions.

Then I looked into bufmgr part. It looks fine to me but I have some
comments on code comments.

* To change the association of a valid buffer, we'll

need to have

* exclusive lock on both the old and new mapping

partitions.

if (oldFlags & BM_TAG_VALID)

We don't take lock on the new mapping partition here.

Thx, fixed.

+ * Clear out the buffer's tag and flags. We must do this to

ensure that

+ * linear scans of the buffer array don't think the buffer is

valid. We

+ * also reset the usage_count since any recency of use of the

old content

+        * is no longer relevant.
+    *
+        * We are single pinner, we hold buffer header lock and exclusive
+        * partition lock (if tag is valid). Given these statements it

is safe to

+ * clear tag since no other process can inspect it to the moment.

This comment is a merger of the comments from InvalidateBuffer and
BufferAlloc. But I think what we need to explain here is why we
invalidate the buffer here despite of we are going to reuse it soon.
And I think we need to state that the old buffer is now safe to use
for the new tag here. I'm not sure the statement is really correct
but clearing-out actually looks like safer.

I've tried to reformulate the comment block.

Now it is safe to use victim buffer for new tag. Invalidate the
buffer before releasing header lock to ensure that linear scans of
the buffer array don't think the buffer is valid. It is safe
because it is guaranteed that we're the single pinner of the buffer.
That pin also prevents the buffer from being stolen by others until
we reuse it or return it to freelist.

So I want to revise the following comment.

-        * Now it is safe to use victim buffer for new tag.
+        * Now reuse victim buffer for new tag.

* Make sure BM_PERMANENT is set for buffers that must be

written at every

* checkpoint. Unlogged buffers only need to be written at

shutdown

* checkpoints, except for their "init" forks, which need to be

treated

* just like permanent relations.
*
* The usage_count starts out at 1 so that the buffer can

survive one

* clock-sweep pass.

But if you think the current commet is fine, I don't insist on the
comment chagnes.

Used suggestion.

Fr, 11/03/22 Yura Sokolov wrote:

В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN. That looks
odd. We should use HASH_ENTER here. Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

I've merged HASH_ASSIGN into HASH_ENTER.

As in previous letter, three commits are concatted to one file
and could be applied with `git am`.

-------

regards

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,
In the description:

There is no need to hold both lock simultaneously.

both lock -> both locks

+ * We also reset the usage_count since any recency of use of the old

recency of use -> recent use

+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

Later on, there is code:

+ reuse ? HASH_REUSE : HASH_REMOVE,

Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool
? That way, flag can be used directly in the above place.

+ long nalloced; /* number of entries initially allocated for

nallocated isn't very long. I think it would be better to name the
field nallocated* '*nallocated'.

+           sum += hashp->hctl->freeList[i].nalloced;
+           sum -= hashp->hctl->freeList[i].nfree;

I think it would be better to calculate the difference between nalloced and
nfree first, then add the result to sum (to avoid overflow).

Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash

memory allocation -> memory allocations

Cheers

#30Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhihong Yu (#29)
1 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:

Hi,
In the description:

There is no need to hold both lock simultaneously.

both lock -> both locks

Thanks.

+ * We also reset the usage_count since any recency of use of the old

recency of use -> recent use

Thanks.

+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

Later on, there is code:

+ reuse ? HASH_REUSE : HASH_REMOVE,

Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? That way, flag can be used directly in the above place.

No.
BufTable* functions are created to abstract Buffer Table from dynahash.
Pass of HASH_REUSE directly will break abstraction.

+ long nalloced; /* number of entries initially allocated for

nallocated isn't very long. I think it would be better to name the field nallocated 'nallocated'.

It is debatable.
Why not num_allocated? allocated_count? number_of_allocations?
Same points for nfree.
`nalloced` is recognizable and unambiguous. And there are a lot
of `*alloced` in the postgresql's source, so this one will not
be unusual.

I don't see the need to make it longer.

But if someone supports your point, I will not mind to changing
the name.

+           sum += hashp->hctl->freeList[i].nalloced;
+           sum -= hashp->hctl->freeList[i].nfree;

I think it would be better to calculate the difference between nalloced and nfree first, then add the result to sum (to avoid overflow).

Doesn't really matter much, because calculation must be valid
even if all nfree==0.

I'd rather debate use of 'long' in dynahash at all: 'long' is
32bit on 64bit Windows. It is better to use 'Size' here.

But 'nelements' were 'long', so I didn't change things. I think
it is place for another patch.

(On the other hand, dynahash with 2**31 elements is at least
512GB RAM... we doubtfully trigger problem before OOM killer
came. Does Windows have an OOM killer?)

Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash

memory allocation -> memory allocations

For each dynahash instance single allocation were reduced.
I think, 'memory allocation' is correct.

Plural will be
reduce memory allocations for non-partitioned dynahashes
ie both 'allocations' and 'dynahashes'.
Am I wrong?

------

regards
Yura Sokolov

Attachments:

v7-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v7-bufmgr-lock-improvements.patchDownload
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recent use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 34b323bf5a26041cd2763a551a976ef043d152f8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 2/3] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 126 ++++++++++++++++++++-----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 113 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7dbfc90aaa..a16da37fe3d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1340,7 +1340,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1534,7 +1534,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..0cb35a0faf9 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -138,8 +138,9 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nalloced tracks the number of free hashtable entries initially allocated
+ * for the freelist.
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,13 +148,15 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for
+								 * the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -170,7 +173,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -254,6 +257,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +305,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -932,6 +950,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +963,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1025,9 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long nentries = hctl->freeList[0].nalloced - hctl->freeList[0].nfree;
+
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1071,11 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+	/* Check HASH_REUSE were called for same dynahash if were */
+	Assert(DynaHashReuse.element == NULL || DynaHashReuse.hashp == hashp);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1089,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1113,21 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,14 +1136,44 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (likely(DynaHashReuse.element == NULL))
+					return (void *) ELEMENTKEY(currBucket);
+
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
 				elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
 					 hashp->tabname);
 
-			currBucket = get_hash_entry(hashp, freelist_idx);
+			if (DynaHashReuse.element == NULL)
+			{
+				currBucket = get_hash_entry(hashp, freelist_idx);
+			}
+			else
+			{
+				currBucket = DynaHashReuse.element;
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+			}
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1301,7 +1375,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1346,14 +1420,11 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 				if (newElement != NULL)
 				{
+					Assert(hctl->freeList[borrow_from_idx].nfree > 0);
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1436,10 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
+	Assert(hctl->freeList[freelist_idx].nfree > 0);
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1454,10 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = 0;
+
+	sum += hashp->hctl->freeList[0].nalloced;
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1467,10 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+		{
+			sum += hashp->hctl->freeList[i].nalloced;
+			sum -= hashp->hctl->freeList[i].nfree;
+		}
 	}
 
 	return sum;
@@ -1739,6 +1817,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl->freeList[freelist_idx].nalloced += nelem;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..2ffcde678a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From 552ade23c3790734c302c3d597eb53d18b92dd5c Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Thu, 3 Mar 2022 01:14:58 +0300
Subject: [PATCH 3/3] reduce memory allocation for non-partitioned
 dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list.
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0cb35a0faf9..0a172005059 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -170,18 +170,6 @@ typedef struct
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -208,6 +196,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -281,7 +281,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -524,7 +524,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -533,7 +534,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -641,11 +642,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

#31Zhihong Yu
zyu@yugabyte.com
In reply to: Yura Sokolov (#30)
Re: BufferAlloc: don't take two simultaneous locks

On Sun, Mar 13, 2022 at 3:27 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:

Hi,
In the description:

There is no need to hold both lock simultaneously.

both lock -> both locks

Thanks.

+ * We also reset the usage_count since any recency of use of the old

recency of use -> recent use

Thanks.

+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

Later on, there is code:

+ reuse ? HASH_REUSE : HASH_REMOVE,

Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of

bool ? That way, flag can be used directly in the above place.

No.
BufTable* functions are created to abstract Buffer Table from dynahash.
Pass of HASH_REUSE directly will break abstraction.

+ long nalloced; /* number of entries initially allocated

for

nallocated isn't very long. I think it would be better to name the field

nallocated 'nallocated'.

It is debatable.
Why not num_allocated? allocated_count? number_of_allocations?
Same points for nfree.
`nalloced` is recognizable and unambiguous. And there are a lot
of `*alloced` in the postgresql's source, so this one will not
be unusual.

I don't see the need to make it longer.

But if someone supports your point, I will not mind to changing
the name.

+           sum += hashp->hctl->freeList[i].nalloced;
+           sum -= hashp->hctl->freeList[i].nfree;

I think it would be better to calculate the difference between nalloced

and nfree first, then add the result to sum (to avoid overflow).

Doesn't really matter much, because calculation must be valid
even if all nfree==0.

I'd rather debate use of 'long' in dynahash at all: 'long' is
32bit on 64bit Windows. It is better to use 'Size' here.

But 'nelements' were 'long', so I didn't change things. I think
it is place for another patch.

(On the other hand, dynahash with 2**31 elements is at least
512GB RAM... we doubtfully trigger problem before OOM killer
came. Does Windows have an OOM killer?)

Subject: [PATCH 3/3] reduce memory allocation for non-partitioned

dynahash

memory allocation -> memory allocations

For each dynahash instance single allocation were reduced.
I think, 'memory allocation' is correct.

Plural will be
reduce memory allocations for non-partitioned dynahashes
ie both 'allocations' and 'dynahashes'.
Am I wrong?

Hi,

bq. reduce memory allocation for non-partitioned dynahash

It seems the following is clearer:

reduce one memory allocation for every non-partitioned dynahash

Cheers

#32Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#26)
Re: BufferAlloc: don't take two simultaneous locks

At Fri, 11 Mar 2022 11:30:27 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:

At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:

Ok, here is v4.

And here is v5.

First, there was compilation error in Assert in dynahash.c .
Excuse me for not checking before sending previous version.

Second, I add third commit that reduces HASHHDR allocation
size for non-partitioned dynahash:
- moved freeList to last position
- alloc and memset offset(HASHHDR, freeList[1]) for
non-partitioned hash tables.
I didn't benchmarked it, but I will be surprised if it
matters much in performance sence.

Third, I put all three commits into single file to not
confuse commitfest application.

Thanks! I looked into dynahash part.

struct HASHHDR
{
- /*
- * The freelist can become a point of contention in high-concurrency hash

Why did you move around the freeList?

-       long            nentries;               /* number of entries in associated buckets */
+       long            nfree;                  /* number of free entries in the list */
+       long            nalloced;               /* number of entries initially allocated for

Why do we need nfree? HASH_ASSING should do the same thing with
HASH_REMOVE. Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp. I think that
should largely reduce the footprint of this patch.

If we keep nentries, then we need to fix nentries in both old
"freeList" partition and new one. It is two freeList[partition]->mutex
lock+unlock pairs.

But count of free elements doesn't change, so if we change nentries
to nfree, then no need to fix freeList[partition]->nfree counters,
no need to lock+unlock.

Ah, okay. I missed that bucket reuse chages key in most cases.

But still I don't think its good to move entries around partition
freelists for another reason. I'm afraid that the freelists get into
imbalanced state. get_hash_entry prefers main shmem allocation than
other freelist so that could lead to freelist bloat, or worse
contension than the traditinal way involving more than two partitions.

I'll examine the possibility to resolve this...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#33Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#27)
Re: BufferAlloc: don't take two simultaneous locks

At Fri, 11 Mar 2022 12:34:32 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:

At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@g> > BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN. That looks
odd. We should use HASH_ENTER here. Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

Honestly, I'm not sure it wins on performance basis. It just came from
interface consistency (mmm. a bit different, maybe.. convincibility?).

regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center

#34Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#32)
Re: BufferAlloc: don't take two simultaneous locks

At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I'll examine the possibility to resolve this...

The existence of nfree and nalloc made me confused and I found the
reason.

In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
elemetns from other paritiotns, nfree gets larger than nalloced. This
is a strange point of the two counters. nalloced is only referred to
as (sum(nalloced[])). So we don't need nalloced per-partition basis
and the formula to calculate the number of used elements would be as
follows.

sum(nalloced - nfree)
= <total_nalloced> - sum(nfree)

We rarely create fresh elements in shared hashes so I don't think
there's additional contention on the <total_nalloced> even if it were
a global atomic.

So, the remaining issue is the possible imbalancement among
partitions. On second thought, by the current way, if there's a bad
deviation in partition-usage, a heavily hit partition finally collects
elements via get_hash_entry(). By the patch's way, similar thing
happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
for something won't be freed until buffer invalidation. But bulk
buffer invalidation won't deviatedly distribute freed buffers among
partitions. So I conclude for now that is a non-issue.

So my opinion on the counters is:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

No need to do something for the possible deviation issue.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#35Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#34)
Re: BufferAlloc: don't take two simultaneous locks

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I'll examine the possibility to resolve this...

The existence of nfree and nalloc made me confused and I found the
reason.

In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
elemetns from other paritiotns, nfree gets larger than nalloced. This
is a strange point of the two counters. nalloced is only referred to
as (sum(nalloced[])). So we don't need nalloced per-partition basis
and the formula to calculate the number of used elements would be as
follows.

sum(nalloced - nfree)
= <total_nalloced> - sum(nfree)

We rarely create fresh elements in shared hashes so I don't think
there's additional contention on the <total_nalloced> even if it were
a global atomic.

So, the remaining issue is the possible imbalancement among
partitions. On second thought, by the current way, if there's a bad
deviation in partition-usage, a heavily hit partition finally collects
elements via get_hash_entry(). By the patch's way, similar thing
happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
for something won't be freed until buffer invalidation. But bulk
buffer invalidation won't deviatedly distribute freed buffers among
partitions. So I conclude for now that is a non-issue.

So my opinion on the counters is:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

I really believe it should be global. I made it per-partition to
not overcomplicate first versions. Glad you tell it.

I thought to protect it with freeList[0].mutex, but probably atomic
is better idea here. But which atomic to chose: uint64 or uint32?
Based on sizeof(long)?
Ok, I'll do in next version.

Whole get_hash_entry look strange.
Doesn't it better to cycle through partitions and only then go to
get_hash_entry?
May be there should be bitmap for non-empty free lists? 32bit for
32 partitions. But wouldn't bitmap became contention point itself?

No need to do something for the possible deviation issue.

-------

regards
Yura Sokolov

#36Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#35)
Re: BufferAlloc: don't take two simultaneous locks

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

I really believe it should be global. I made it per-partition to
not overcomplicate first versions. Glad you tell it.

I thought to protect it with freeList[0].mutex, but probably atomic
is better idea here. But which atomic to chose: uint64 or uint32?
Based on sizeof(long)?
Ok, I'll do in next version.

Current nentries is a long (= int64 on CentOS). And uint32 can support
roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
enough. So it would be uint64.

Whole get_hash_entry look strange.
Doesn't it better to cycle through partitions and only then go to
get_hash_entry?
May be there should be bitmap for non-empty free lists? 32bit for
32 partitions. But wouldn't bitmap became contention point itself?

The code puts significance on avoiding contention caused by visiting
freelists of other partitions. And perhaps thinks that freelist
shortage rarely happen.

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid. The change in dynahash actually
continuously/repeatedly causes lack of free list entries. I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.

By the way, there's the following comment in StrategyInitalize.

* Initialize the shared buffer lookup hashtable.
*
* Since we can't tolerate running out of lookup table entries, we must be
* sure to specify an adequate table size here. The maximum steady-state
* usage is of course NBuffers entries, but BufferAlloc() tries to insert
* a new entry before deleting the old. In principle this could be
* happening in each partition concurrently, so we could need as many as
* NBuffers + NUM_BUFFER_PARTITIONS entries.
*/
InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

"but BufferAlloc() tries to insert a new entry before deleting the
old." gets false by this patch but still need that additional room for
stashed entries. It seems like needing a fix.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#37Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#36)
Re: BufferAlloc: don't take two simultaneous locks

At Mon, 14 Mar 2022 17:12:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid.

Which means my last decision was wrong with high odds..

--
Kyotaro Horiguchi
NTT Open Source Software Center

#38Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#36)
Re: BufferAlloc: don't take two simultaneous locks

В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

I really believe it should be global. I made it per-partition to
not overcomplicate first versions. Glad you tell it.

I thought to protect it with freeList[0].mutex, but probably atomic
is better idea here. But which atomic to chose: uint64 or uint32?
Based on sizeof(long)?
Ok, I'll do in next version.

Current nentries is a long (= int64 on CentOS). And uint32 can support
roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
enough. So it would be uint64.

Whole get_hash_entry look strange.
Doesn't it better to cycle through partitions and only then go to
get_hash_entry?
May be there should be bitmap for non-empty free lists? 32bit for
32 partitions. But wouldn't bitmap became contention point itself?

The code puts significance on avoiding contention caused by visiting
freelists of other partitions. And perhaps thinks that freelist
shortage rarely happen.

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid. The change in dynahash actually
continuously/repeatedly causes lack of free list entries. I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

Show quoted text

By the way, there's the following comment in StrategyInitalize.

* Initialize the shared buffer lookup hashtable.
*
* Since we can't tolerate running out of lookup table entries, we must be
* sure to specify an adequate table size here. The maximum steady-state
* usage is of course NBuffers entries, but BufferAlloc() tries to insert
* a new entry before deleting the old. In principle this could be
* happening in each partition concurrently, so we could need as many as
* NBuffers + NUM_BUFFER_PARTITIONS entries.
*/
InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

"but BufferAlloc() tries to insert a new entry before deleting the
old." gets false by this patch but still need that additional room for
stashed entries. It seems like needing a fix.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#39Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#38)
4 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:

В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

I really believe it should be global. I made it per-partition to
not overcomplicate first versions. Glad you tell it.

I thought to protect it with freeList[0].mutex, but probably atomic
is better idea here. But which atomic to chose: uint64 or uint32?
Based on sizeof(long)?
Ok, I'll do in next version.

Current nentries is a long (= int64 on CentOS). And uint32 can support
roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
enough. So it would be uint64.

Whole get_hash_entry look strange.
Doesn't it better to cycle through partitions and only then go to
get_hash_entry?
May be there should be bitmap for non-empty free lists? 32bit for
32 partitions. But wouldn't bitmap became contention point itself?

The code puts significance on avoiding contention caused by visiting
freelists of other partitions. And perhaps thinks that freelist
shortage rarely happen.

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid. The change in dynahash actually
continuously/repeatedly causes lack of free list entries. I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

Well, I did both. Everything looks ok.

By the way, there's the following comment in StrategyInitalize.

* Initialize the shared buffer lookup hashtable.
*
* Since we can't tolerate running out of lookup table entries, we must be
* sure to specify an adequate table size here. The maximum steady-state
* usage is of course NBuffers entries, but BufferAlloc() tries to insert
* a new entry before deleting the old. In principle this could be
* happening in each partition concurrently, so we could need as many as
* NBuffers + NUM_BUFFER_PARTITIONS entries.
*/
InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

"but BufferAlloc() tries to insert a new entry before deleting the
old." gets false by this patch but still need that additional room for
stashed entries. It seems like needing a fix.

Removed whole paragraph because fixed table without extra entries works
just fine.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

128MB and 1GB shared buffers
pgbench with scale 100
select_only benchmark, unix sockets.

Notebook i7-1165G7:

conns | master | v8 | master 1G | v8 1G
--------+------------+------------+------------+------------
1 | 29614 | 29285 | 32413 | 32784
2 | 58541 | 60052 | 65851 | 65938
3 | 91126 | 90185 | 101404 | 101956
5 | 135809 | 133670 | 143783 | 143471
7 | 155547 | 153568 | 162566 | 162361
17 | 221794 | 218143 | 250562 | 250136
27 | 213742 | 211226 | 241806 | 242594
53 | 216067 | 214792 | 245868 | 246269
83 | 216610 | 218261 | 246798 | 250515
107 | 216169 | 216656 | 248424 | 250105
139 | 208892 | 215054 | 244630 | 246439
163 | 206988 | 212751 | 244061 | 248051
191 | 203842 | 214764 | 241793 | 245081
211 | 201304 | 213997 | 240863 | 246076
239 | 199313 | 211713 | 239639 | 243586
271 | 196712 | 211849 | 236231 | 243831
307 | 194879 | 209813 | 233811 | 241303
353 | 191279 | 210145 | 230896 | 241039
397 | 188509 | 207480 | 227812 | 240637

X5675 1 socket:

conns | master | v8 | master 1G | v8 1G
--------+------------+------------+------------+------------
1 | 18590 | 18473 | 19652 | 19051
2 | 34899 | 34799 | 37242 | 37432
3 | 51484 | 51393 | 54750 | 54398
5 | 71037 | 70564 | 76482 | 75985
7 | 87391 | 86937 | 96185 | 95433
17 | 122609 | 123087 | 140578 | 140325
27 | 120051 | 120508 | 136318 | 136343
53 | 116851 | 117601 | 133338 | 133265
83 | 113682 | 116755 | 131841 | 132736
107 | 111925 | 116003 | 130661 | 132386
139 | 109338 | 115011 | 128319 | 131453
163 | 107661 | 114398 | 126684 | 130677
191 | 105000 | 113745 | 124850 | 129909
211 | 103607 | 113347 | 123469 | 129302
239 | 101820 | 112428 | 121752 | 128621
271 | 100060 | 111863 | 119743 | 127624
307 | 98554 | 111270 | 117650 | 126877
353 | 97530 | 110231 | 115904 | 125351
397 | 96122 | 109471 | 113609 | 124150

X5675 2 socket:

conns | master | v8 | master 1G | v8 1G
--------+------------+------------+------------+------------
1 | 17815 | 17577 | 19321 | 19187
2 | 34312 | 35655 | 37121 | 36479
3 | 51868 | 52165 | 56048 | 54984
5 | 81704 | 82477 | 90945 | 90109
7 | 107937 | 105411 | 116015 | 115810
17 | 191339 | 190813 | 216899 | 215775
27 | 236541 | 238078 | 278507 | 278073
53 | 230323 | 231709 | 267226 | 267449
83 | 225560 | 227455 | 261996 | 262344
107 | 221317 | 224030 | 259694 | 259553
139 | 206945 | 219005 | 254817 | 256736
163 | 197723 | 220353 | 251631 | 257305
191 | 193243 | 219149 | 246960 | 256528
211 | 189603 | 218545 | 245362 | 255785
239 | 186382 | 217229 | 240006 | 255024
271 | 183141 | 216359 | 236927 | 253069
307 | 179275 | 215218 | 232571 | 252375
353 | 175559 | 213298 | 227244 | 250534
397 | 172916 | 211627 | 223513 | 248919

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]https://postgr.esq/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

October master was 2d44dee0281a1abf and today's is 7e12256b478b895

(There is small possibility that I tested with TCP sockets
in october and with UNIX sockets today and that gave difference.)

[1]: https://postgr.esq/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru

-------

regards
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru

Attachments:

v8-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v8-bufmgr-lock-improvements.patchDownload
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recent use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 51da98121aa2404d1e3e3a42f5f40fddb9877e61 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 2/4] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 142 +++++++++++++++++++++----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 129 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7dbfc90aaa..a16da37fe3d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1340,7 +1340,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1534,7 +1534,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..4d44276e3e6 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -98,6 +98,7 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "port/atomics.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -138,8 +139,7 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,16 +147,30 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
+#if SIZEOF_LONG == 4
+typedef pg_atomic_uint32 nalloced_store_t;
+typedef uint32 nalloced_value_t;
+#define nalloced_read(a)	(long)pg_atomic_read_u32(a)
+#define nalloced_add(a, v)	pg_atomic_fetch_add_u32((a), (uint32)(v))
+#define nalloced_init(a)	pg_atomic_init_u32((a), 0)
+#else
+typedef pg_atomic_uint64 nalloced_t;
+typedef uint64 nalloced_value_t;
+#define nalloced_read(a)	(long)pg_atomic_read_u64(a)
+#define nalloced_add(a, v)	pg_atomic_fetch_add_u64((a), (uint64)(v))
+#define nalloced_init(a)	pg_atomic_init_u64((a), 0)
+#endif
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -170,7 +184,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -195,6 +209,7 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	nalloced_t	nalloced;		/* number of entries allocated */
 
 #ifdef HASH_STATISTICS
 
@@ -254,6 +269,16 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+	int			freelist_idx;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +318,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL, 0};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -640,6 +671,8 @@ hdefault(HTAB *hashp)
 	hctl->ssize = DEF_SEGSIZE;
 	hctl->sshift = DEF_SEGSIZE_SHIFT;
 
+	nalloced_init(&hctl->nalloced);
+
 #ifdef HASH_STATISTICS
 	hctl->accesses = hctl->collisions = 0;
 #endif
@@ -932,6 +965,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +978,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1040,10 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long		nentries;
+
+		nentries = nalloced_read(&hctl->nalloced) - hctl->freeList[0].nfree;
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1087,11 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+	/* Check HASH_REUSE were called for same dynahash if were */
+	Assert(DynaHashReuse.element == NULL || DynaHashReuse.hashp == hashp);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1105,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1129,22 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+				DynaHashReuse.freelist_idx = freelist_idx;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,14 +1153,47 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (likely(DynaHashReuse.element == NULL))
+					return (void *) ELEMENTKEY(currBucket);
+
+				freelist_idx = DynaHashReuse.freelist_idx;
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+				DynaHashReuse.freelist_idx = 0;
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
 				elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
 					 hashp->tabname);
 
-			currBucket = get_hash_entry(hashp, freelist_idx);
+			if (DynaHashReuse.element == NULL)
+			{
+				currBucket = get_hash_entry(hashp, freelist_idx);
+			}
+			else
+			{
+				currBucket = DynaHashReuse.element;
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+				DynaHashReuse.freelist_idx = 0;
+			}
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1301,7 +1395,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1346,14 +1440,11 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 				if (newElement != NULL)
 				{
+					Assert(hctl->freeList[borrow_from_idx].nfree > 0);
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1456,10 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
+	Assert(hctl->freeList[freelist_idx].nfree > 0);
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1474,9 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = nalloced_read(&hashp->hctl->nalloced);
+
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1486,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum -= hashp->hctl->freeList[i].nfree;
 	}
 
 	return sum;
@@ -1739,6 +1833,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	nalloced_add(&hctl->nalloced, nelem);
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..2ffcde678a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From 2a086abe8c184a88cf984b2ef2cdc732aa64f1b7 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 14 Mar 2022 17:22:26 +0300
Subject: [PATCH 3/4] fixed BufTable

Since elements now deleted before insertion in BufferAlloc, there is no
need in excess BufTable elements. And looks like it could be safely
declared as HASH_FIXED_SIZE.
---
 src/backend/storage/buffer/buf_table.c |  3 ++-
 src/backend/storage/buffer/freelist.c  | 13 +++----------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index c189555751e..55bb491ad05 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -63,7 +63,8 @@ InitBufTable(int size)
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION |
+								  HASH_FIXED_SIZE);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50f..f4733434a3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -455,8 +455,8 @@ StrategyShmemSize(void)
 {
 	Size		size = 0;
 
-	/* size of lookup hash table ... see comment in StrategyInitialize */
-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	/* size of lookup hash table */
+	size = add_size(size, BufTableShmemSize(NBuffers));
 
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
@@ -478,15 +478,8 @@ StrategyInitialize(bool init)
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
-	 *
-	 * Since we can't tolerate running out of lookup table entries, we must be
-	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(NBuffers);
 
 	/*
 	 * Get or create the shared strategy control block
-- 
2.35.1


From 649d69f8a3d175502f67c777688d635ba8d70d44 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Thu, 3 Mar 2022 01:14:58 +0300
Subject: [PATCH 4/4] reduce memory allocation for non-partitioned
 dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list.
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 4d44276e3e6..50c0e476432 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -181,18 +181,6 @@ typedef uint64 nalloced_value_t;
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -220,6 +208,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -294,7 +294,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -537,7 +537,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -546,7 +547,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -654,11 +655,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

v8-1socket.gifimage/gif; name=v8-1socket.gifDownload
v8-2socket.gifimage/gif; name=v8-2socket.gifDownload
v8-notebook.gifimage/gif; name=v8-notebook.gifDownload
#40Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#39)
2 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

Thanks for the new version.

At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:

В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid. The change in dynahash actually
continuously/repeatedly causes lack of free list entries. I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

Well, I did both. Everything looks ok.

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused. But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like
behaving the same way (or somehow even worse) with the previous
version. get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

ALLOCED: 31 ;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

RETURNED: 50806 ;; returned stashed elements
BORROWED: 33620 ;; borrowed from another freelist
REUSED: 1812664 ;; stashed
ASSIGNED: 1762377 ;; reused
(ALLOCED: 0) ;; freshly allocated

It contains a huge degradation by frequent elog's so they cannot be
naively relied on, but it should show what is happening sufficiently.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

measure_master.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940fa..ac651b98e6 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -31,7 +31,7 @@ typedef struct
 	int			id;				/* Associated buffer ID */
 } BufferLookupEnt;
 
-static HTAB *SharedBufHash;
+HTAB *SharedBufHash;
 
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d70..294516ef01 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -195,6 +195,11 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	int alloc;
+	int reuse;
+	int borrow;
+	int assign;
+	int ret;
 
 #ifdef HASH_STATISTICS
 
@@ -963,6 +968,7 @@ hash_search(HTAB *hashp,
 									   foundPtr);
 }
 
+extern HTAB *SharedBufHash;
 void *
 hash_search_with_hash_value(HTAB *hashp,
 							const void *keyPtr,
@@ -1354,6 +1360,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 					hctl->freeList[freelist_idx].nentries++;
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
 
+					if (hashp == SharedBufHash)
+						elog(LOG, "BORROWED: %d", ++hctl->borrow);
 					return newElement;
 				}
 
@@ -1363,6 +1371,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		else if (hashp == SharedBufHash)
+			elog(LOG, "ALLOCED: %d", ++hctl->alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
measure_patched.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 55bb491ad0..029bb89f26 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -31,7 +31,7 @@ typedef struct
 	int			id;				/* Associated buffer ID */
 } BufferLookupEnt;
 
-static HTAB *SharedBufHash;
+HTAB *SharedBufHash;
 
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 50c0e47643..00159714d1 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -199,6 +199,11 @@ struct HASHHDR
 	int			nelem_alloc;	/* number of entries to allocate at once */
 	nalloced_t	nalloced;		/* number of entries allocated */
 
+	int alloc;
+	int reuse;
+	int borrow;
+	int assign;
+	int ret;
 #ifdef HASH_STATISTICS
 
 	/*
@@ -1006,6 +1011,7 @@ hash_search(HTAB *hashp,
 									   foundPtr);
 }
 
+extern HTAB *SharedBufHash;
 void *
 hash_search_with_hash_value(HTAB *hashp,
 							const void *keyPtr,
@@ -1143,6 +1149,8 @@ hash_search_with_hash_value(HTAB *hashp,
 				DynaHashReuse.hashp = hashp;
 				DynaHashReuse.freelist_idx = freelist_idx;
 
+				if (hashp == SharedBufHash)
+					elog(LOG, "REUSED: %d", ++hctl->reuse);
 				/* Caller should call HASH_ASSIGN as the very next step. */
 				return (void *) ELEMENTKEY(currBucket);
 			}
@@ -1160,6 +1168,9 @@ hash_search_with_hash_value(HTAB *hashp,
 				if (likely(DynaHashReuse.element == NULL))
 					return (void *) ELEMENTKEY(currBucket);
 
+				if (hashp == SharedBufHash)
+					elog(LOG, "RETURNED: %d", ++hctl->ret);
+
 				freelist_idx = DynaHashReuse.freelist_idx;
 				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
@@ -1191,6 +1202,13 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			else
 			{
+				if (hashp == SharedBufHash)
+				{
+					hctl->assign++;
+					elog(LOG, "ASSIGNED: %d (%d)",
+						 hctl->assign, hctl->reuse - hctl->assign);
+				}
+					
 				currBucket = DynaHashReuse.element;
 				DynaHashReuse.element = NULL;
 				DynaHashReuse.hashp = NULL;
@@ -1448,6 +1466,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
+					if (hashp == SharedBufHash)
+						elog(LOG, "BORROWED: %d", ++hctl->borrow);
 					return newElement;
 				}
 
@@ -1457,6 +1477,10 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		else if (hashp == SharedBufHash)
+			elog(LOG, "ALLOCED: %d", ++hctl->alloc);
+
+			
 	}
 
 	/* remove entry from freelist, decrease nfree */
#41Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#40)
Re: BufferAlloc: don't take two simultaneous locks

В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:

Thanks for the new version.

At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:

В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid. The change in dynahash actually
continuously/repeatedly causes lack of free list entries. I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

Well, I did both. Everything looks ok.

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused. But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like
behaving the same way (or somehow even worse) with the previous
version.

v8 doesn't differ in REMOVE case neither from master nor from
previous version. It differs in RETURNED case only.
Or I didn't understand what you mean :(

get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

ALLOCED: 31 ;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

RETURNED: 50806 ;; returned stashed elements
BORROWED: 33620 ;; borrowed from another freelist
REUSED: 1812664 ;; stashed
ASSIGNED: 1762377 ;; reused
(ALLOCED: 0) ;; freshly allocated

It contains a huge degradation by frequent elog's so they cannot be
naively relied on, but it should show what is happening sufficiently.

Is there any measurable performance hit cause of borrowing?
Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
shared buffers that is extremely small. (Or it was 128MB?)

Well, I think some spare entries could reduce borrowing if there is
a need. I'll test on 128MB with spare entries. If there is profit,
I'll return some, but will keep SharedBufHash fixed.

Master branch does less freelist manipulations since it tries to
insert first and if there is collision it doesn't delete victim
buffer.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.

`master` changed its behaviour as well.
It is not problem of the patch at all.

------

regards
Yura.

#42Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#41)
Re: BufferAlloc: don't take two simultaneous locks

В Вт, 15/03/2022 в 13:47 +0300, Yura Sokolov пишет:

В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.

`master` changed its behaviour as well.
It is not problem of the patch at all.

Looks like there is no issue: old commmit 2d44dee0281a1abf
behaves similar to new one at the moment.

I think, something changed in environment.
I remember there were maintanance downtime in the autumn.
Perhaps kernel were updated or some sysctl tuning changed.

----

regards
Yura.

#43Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#41)
Re: BufferAlloc: don't take two simultaneous locks

В Вт, 15/03/2022 в 13:47 +0300, Yura Sokolov пишет:

В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:

Thanks for the new version.

At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:

В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:

At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run. So I found that my
concern was valid. The change in dynahash actually
continuously/repeatedly causes lack of free list entries. I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

Well, I did both. Everything looks ok.

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused. But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like
behaving the same way (or somehow even worse) with the previous
version.

v8 doesn't differ in REMOVE case neither from master nor from
previous version. It differs in RETURNED case only.
Or I didn't understand what you mean :(

get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

ALLOCED: 31 ;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

RETURNED: 50806 ;; returned stashed elements
BORROWED: 33620 ;; borrowed from another freelist
REUSED: 1812664 ;; stashed
ASSIGNED: 1762377 ;; reused
(ALLOCED: 0) ;; freshly allocated

It contains a huge degradation by frequent elog's so they cannot be
naively relied on, but it should show what is happening sufficiently.

Is there any measurable performance hit cause of borrowing?
Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
shared buffers that is extremely small. (Or it was 128MB?)

Well, I think some spare entries could reduce borrowing if there is
a need. I'll test on 128MB with spare entries. If there is profit,
I'll return some, but will keep SharedBufHash fixed.

Well, I added GetMaxBackends spare items, but I don't see certain
profit. It is probably a bit better at 128MB shared buffers and
probably a bit worse at 1GB shared buffers (select_only on scale 100).

But it is on old Xeon X5675. Probably things will change on more
capable hardware. I just don't have access at the moment.

Master branch does less freelist manipulations since it tries to
insert first and if there is collision it doesn't delete victim
buffer.

-----

regards
Yura

#44Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#41)
Re: BufferAlloc: don't take two simultaneous locks

At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused. But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like
behaving the same way (or somehow even worse) with the previous
version.

v8 doesn't differ in REMOVE case neither from master nor from
previous version. It differs in RETURNED case only.
Or I didn't understand what you mean :(

In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key. v8 uses that of the old key (at the
time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition. But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case. I suspect that ththat the frequent freelist
starvation comes from the latter case.

get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

ALLOCED: 31 ;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

RETURNED: 50806 ;; returned stashed elements
BORROWED: 33620 ;; borrowed from another freelist
REUSED: 1812664 ;; stashed
ASSIGNED: 1762377 ;; reused
(ALLOCED: 0) ;; freshly allocated

(I misunderstand that v8 modified get_hash_entry's preference between
allocation and borrowing.)

I re-ran the same check for v7 and it showed different result.

RETURNED: 1
ALLOCED: 15
BORROWED: 0
REUSED: 505435
ASSIGNED: 505462 (-27) ## the counters are not locked.

Is there any measurable performance hit cause of borrowing?
Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
shared buffers that is extremely small. (Or it was 128MB?)

It is intentional set small to get extremely frequent buffer
replacements. The point here was the patch actually can induce
frequent freelist starvation. And as you do, I also doubt the
significance of the performance hit by that. Just I was not usre.

I re-ran the same for v8 and got a result largely different from the
previous trial on the same v8.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED.

Well, I think some spare entries could reduce borrowing if there is
a need. I'll test on 128MB with spare entries. If there is profit,
I'll return some, but will keep SharedBufHash fixed.

I don't doubt the benefit of this patch. And now convinced by myself
that the downside is negligible than the benefit.

Master branch does less freelist manipulations since it tries to
insert first and if there is collision it doesn't delete victim
buffer.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.

`master` changed its behaviour as well.
It is not problem of the patch at all.

Agreed. So I think we should go on this direction.

There are some last comments on v8.

+ HASH_FIXED_SIZE);

Ah, now I understand that this prevented allocation of new elements.
I think this good to do for SharedBufHash.

====
+ long nfree; /* number of free entries in the list */
HASHELEMENT *freeList; /* chain of free elements */
} FreeListData;

+#if SIZEOF_LONG == 4
+typedef pg_atomic_uint32 nalloced_store_t;
+typedef uint32 nalloced_value_t;
+#define nalloced_read(a)	(long)pg_atomic_read_u32(a)
+#define nalloced_add(a, v)	pg_atomic_fetch_add_u32((a), (uint32)(v))
====

I don't think nalloced needs to be the same width to long. For the
platforms with 32-bit long, anyway the possible degradation if any by
64-bit atomic there doesn't matter. So don't we always define the
atomic as 64bit and use the pg_atomic_* functions directly?

+ case HASH_REUSE:
+ if (currBucket != NULL)

Don't we need an assertion on (DunaHashReuse.element == NULL) here?

-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	/* size of lookup hash table */
+	size = add_size(size, BufTableShmemSize(NBuffers));

I was not sure that this is safe, but actually I didn't get "out of
shared memory". On second thought, I realized that when a dynahash
entry is stashed, BufferAlloc always holding a buffer block, too.
So now I'm sure that this is safe.

That's all.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#45Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#44)
Re: BufferAlloc: don't take two simultaneous locks

В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:

At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused. But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like
behaving the same way (or somehow even worse) with the previous
version.

v8 doesn't differ in REMOVE case neither from master nor from
previous version. It differs in RETURNED case only.
Or I didn't understand what you mean :(

In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key. v8 uses that of the old key (at the
time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition. But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case. I suspect that ththat the frequent freelist
starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists.

get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

ALLOCED: 31 ;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

RETURNED: 50806 ;; returned stashed elements
BORROWED: 33620 ;; borrowed from another freelist
REUSED: 1812664 ;; stashed
ASSIGNED: 1762377 ;; reused
(ALLOCED: 0) ;; freshly allocated

(I misunderstand that v8 modified get_hash_entry's preference between
allocation and borrowing.)

I re-ran the same check for v7 and it showed different result.

RETURNED: 1
ALLOCED: 15
BORROWED: 0
REUSED: 505435
ASSIGNED: 505462 (-27) ## the counters are not locked.

Is there any measurable performance hit cause of borrowing?
Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
shared buffers that is extremely small. (Or it was 128MB?)

It is intentional set small to get extremely frequent buffer
replacements. The point here was the patch actually can induce
frequent freelist starvation. And as you do, I also doubt the
significance of the performance hit by that. Just I was not usre.

I re-ran the same for v8 and got a result largely different from the
previous trial on the same v8.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

Well, I think some spare entries could reduce borrowing if there is
a need. I'll test on 128MB with spare entries. If there is profit,
I'll return some, but will keep SharedBufHash fixed.

I don't doubt the benefit of this patch. And now convinced by myself
that the downside is negligible than the benefit.

Master branch does less freelist manipulations since it tries to
insert first and if there is collision it doesn't delete victim
buffer.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.

`master` changed its behaviour as well.
It is not problem of the patch at all.

Agreed. So I think we should go on this direction.

I've checked. Looks like something had changed on the server, since
old master commit behaves now same to new one (and differently to
how it behaved in October).
I remember maintainance downtime of the server in november/december.
Probably, kernel were upgraded or some system settings were changed.

There are some last comments on v8.

+ HASH_FIXED_SIZE);

Ah, now I understand that this prevented allocation of new elements.
I think this good to do for SharedBufHash.

====
+ long nfree; /* number of free entries in the list */
HASHELEMENT *freeList; /* chain of free elements */
} FreeListData;

+#if SIZEOF_LONG == 4
+typedef pg_atomic_uint32 nalloced_store_t;
+typedef uint32 nalloced_value_t;
+#define nalloced_read(a)       (long)pg_atomic_read_u32(a)
+#define nalloced_add(a, v)     pg_atomic_fetch_add_u32((a), (uint32)(v))
====

I don't think nalloced needs to be the same width to long. For the
platforms with 32-bit long, anyway the possible degradation if any by
64-bit atomic there doesn't matter. So don't we always define the
atomic as 64bit and use the pg_atomic_* functions directly?

Some 32bit platforms has no native 64bit atomics. Then they are
emulated with locks.

Native atomic read/write is quite cheap. So I don't bother with
unlocked read/write for non-partitioned table. (And I don't know
which platform has sizeof(long)>4 without having native 64bit
atomic as well)

(May be I'm wrong a bit? element_alloc invokes nalloc_add, which
is atomic increment. Could it be expensive enough to be problem
in non-shared dynahash instances?)

If patch stick with pg_atomic_uint64 for nalloced, then it have
to separate read+write for partitioned(actually shared) and
non-partitioned cases.

Well, and for 32bit platform long is just enough. Why spend other
4 bytes per each dynahash?

By the way, there is unfortunate miss of PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
in port/atomics/arch-arm.h for aarch64 . I'll send patch for
in new thread.

+               case HASH_REUSE:
+                       if (currBucket != NULL)

Don't we need an assertion on (DunaHashReuse.element == NULL) here?

Common assert is higher on line 1094:

Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);

I thought it is more accurate than duplicated in each switch case.

-       size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+       /* size of lookup hash table */
+       size = add_size(size, BufTableShmemSize(NBuffers));

I was not sure that this is safe, but actually I didn't get "out of
shared memory". On second thought, I realized that when a dynahash
entry is stashed, BufferAlloc always holding a buffer block, too.
So now I'm sure that this is safe.

That's all.

Thanks you very much for productive review and discussion.

regards,

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#46Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#45)
Re: BufferAlloc: don't take two simultaneous locks

At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:

At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key. v8 uses that of the old key (at the
time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition. But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case. I suspect that ththat the frequent freelist
starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

Yeah. I think so generally.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists.

Anyway, it's an extreme condition and the starvation happens only at a
neglegible ratio.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

Mmm. Doesn't matter:p

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

...

I've checked. Looks like something had changed on the server, since
old master commit behaves now same to new one (and differently to
how it behaved in October).
I remember maintainance downtime of the server in november/december.
Probably, kernel were upgraded or some system settings were changed.

One thing I have a little concern is that numbers shows 1-2% of
degradation steadily for connection numbers < 17.

I think there are two possible cause of the degradation.

1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
This might cause degradation for memory-contended use.

2. nallocs operation might cause degradation on non-shared dynahasyes?
I believe doesn't but I'm not sure.

On a simple benchmarking with pgbench on a laptop, dynahash
allocation (including shared and non-shared) happend about at 50
times per second with 10 processes and 200 with 100 processes.

I don't think nalloced needs to be the same width to long. For the
platforms with 32-bit long, anyway the possible degradation if any by
64-bit atomic there doesn't matter. So don't we always define the
atomic as 64bit and use the pg_atomic_* functions directly?

Some 32bit platforms has no native 64bit atomics. Then they are
emulated with locks.

Well, and for 32bit platform long is just enough. Why spend other
4 bytes per each dynahash?

I don't think additional bytes doesn't matter, but emulated atomic
operations can matter. However I'm not sure which platform uses that
fallback implementations. (x86 seems to have __sync_fetch_and_add()
since P4).

My opinion in the previous mail is that if that level of degradation
caued by emulated atomic operations matters, we shouldn't use atomic
there at all since atomic operations on the modern platforms are not
also free.

In relation to 2 above, if we observe that the degradation disappears
by (tentatively) use non-atomic operations for nalloced, we should go
back to the previous per-freelist nalloced.

I don't have access to such a musculous machines, though..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#47Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#46)
1 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:

At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key. v8 uses that of the old key (at the
time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition. But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case. I suspect that ththat the frequent freelist
starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

Yeah. I think so generally.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists.

Anyway, it's an extreme condition and the starvation happens only at a
neglegible ratio.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

Mmm. Doesn't matter:p

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

...

I've checked. Looks like something had changed on the server, since
old master commit behaves now same to new one (and differently to
how it behaved in October).
I remember maintainance downtime of the server in november/december.
Probably, kernel were upgraded or some system settings were changed.

One thing I have a little concern is that numbers shows 1-2% of
degradation steadily for connection numbers < 17.

I think there are two possible cause of the degradation.

1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
This might cause degradation for memory-contended use.

2. nallocs operation might cause degradation on non-shared dynahasyes?
I believe doesn't but I'm not sure.

On a simple benchmarking with pgbench on a laptop, dynahash
allocation (including shared and non-shared) happend about at 50
times per second with 10 processes and 200 with 100 processes.

I don't think nalloced needs to be the same width to long. For the
platforms with 32-bit long, anyway the possible degradation if any by
64-bit atomic there doesn't matter. So don't we always define the
atomic as 64bit and use the pg_atomic_* functions directly?

Some 32bit platforms has no native 64bit atomics. Then they are
emulated with locks.

Well, and for 32bit platform long is just enough. Why spend other
4 bytes per each dynahash?

I don't think additional bytes doesn't matter, but emulated atomic
operations can matter. However I'm not sure which platform uses that
fallback implementations. (x86 seems to have __sync_fetch_and_add()
since P4).

My opinion in the previous mail is that if that level of degradation
caued by emulated atomic operations matters, we shouldn't use atomic
there at all since atomic operations on the modern platforms are not
also free.

In relation to 2 above, if we observe that the degradation disappears
by (tentatively) use non-atomic operations for nalloced, we should go
back to the previous per-freelist nalloced.

Here is version with nalloced being union of appropriate atomic and
long.

------

regards
Yura Sokolov

Attachments:

v9-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v9-bufmgr-lock-improvements.patchDownload
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Tags: lwlock_numa
---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recent use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 5e0e87dbc87843fb45d39ce6855e286a23934a1c Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 2/4] [PGPRO-5616] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com

Tags: lwlock_numa
---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 183 ++++++++++++++++++++++---
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 171 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7dbfc90aaa..a16da37fe3d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1340,7 +1340,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1534,7 +1534,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..f774d09972c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -98,6 +98,7 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "port/atomics.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -138,8 +139,7 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,16 +147,26 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
+typedef union
+{
+#if SIZEOF_LONG == 4
+	pg_atomic_uint32 a;
+#else
+	pg_atomic_uint64 a;
+#endif
+	long		l;
+}			nalloced_t;
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -170,7 +180,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -195,6 +205,7 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	nalloced_t	nalloced;		/* number of entries allocated */
 
 #ifdef HASH_STATISTICS
 
@@ -254,6 +265,16 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+	int			freelist_idx;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -269,6 +290,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
+static void free_reused_entry(HTAB *hashp);
 static void hdefault(HTAB *hashp);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
@@ -293,6 +315,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL, 0};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -306,6 +334,42 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+static inline long
+hctl_nalloced(HASHHDR *hctl)
+{
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		return (long) pg_atomic_read_u32(&hctl->nalloced.a);
+#else
+		return (long) pg_atomic_read_u64(&hctl->nalloced.a);
+#endif
+	return hctl->nalloced.l;
+}
+
+static inline void
+hctl_nalloced_add(HASHHDR *hctl, long v)
+{
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		pg_atomic_fetch_add_u32(&hctl->nalloced.a, (int32) v);
+#else
+		pg_atomic_fetch_add_u64(&hctl->nalloced.a, (int64) v);
+#endif
+	else
+		hctl->nalloced.l += v;
+}
+
+static inline void
+hctl_nalloced_init(HASHHDR *hctl)
+{
+	hctl->nalloced.l = 0;
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		pg_atomic_init_u32(&hctl->nalloced.a, 0);
+#else
+		pg_atomic_init_u64(&hctl->nalloced.a, 0);
+#endif
+}
 
 /************************** CREATE ROUTINES **********************/
 
@@ -534,6 +598,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hctl->num_partitions = info->num_partitions;
 	}
 
+	hctl_nalloced_init(hctl);
+
 	if (flags & HASH_SEGMENT)
 	{
 		hctl->ssize = info->ssize;
@@ -932,6 +998,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +1011,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1073,10 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long		nentries;
+
+		nentries = hctl_nalloced(hctl) - hctl->freeList[0].nfree;
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1120,9 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1136,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1160,22 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+				DynaHashReuse.freelist_idx = freelist_idx;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,7 +1184,12 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (unlikely(DynaHashReuse.element != NULL))
+					free_reused_entry(hashp);
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
@@ -1100,6 +1197,7 @@ hash_search_with_hash_value(HTAB *hashp,
 					 hashp->tabname);
 
 			currBucket = get_hash_entry(hashp, freelist_idx);
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1292,6 +1390,7 @@ hash_update_hash_key(HTAB *hashp,
  * Allocate a new hashtable entry if possible; return NULL if out of memory.
  * (Or, if the underlying space allocator throws error for out-of-memory,
  * we won't return at all.)
+ * Return element stored with HASH_REUSE if any.
  */
 static HASHBUCKET
 get_hash_entry(HTAB *hashp, int freelist_idx)
@@ -1299,9 +1398,21 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 	HASHHDR    *hctl = hashp->hctl;
 	HASHBUCKET	newElement;
 
+	if (unlikely(DynaHashReuse.element != NULL))
+	{
+		Assert(DynaHashReuse.hashp == hashp);
+
+		newElement = DynaHashReuse.element;
+		DynaHashReuse.element = NULL;
+		DynaHashReuse.hashp = NULL;
+		DynaHashReuse.freelist_idx = 0;
+
+		return newElement;
+	}
+
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1346,14 +1457,11 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 				if (newElement != NULL)
 				{
+					Assert(hctl->freeList[borrow_from_idx].nfree > 0);
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1473,10 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
+	Assert(hctl->freeList[freelist_idx].nfree > 0);
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1375,6 +1484,32 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 	return newElement;
 }
 
+/* Return entry stored with HASH_REUSE into appropriate freelist. */
+static void
+free_reused_entry(HTAB *hashp)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = DynaHashReuse.freelist_idx;
+
+	Assert(DynaHashReuse.hashp == hashp);
+
+	/* if partitioned, must lock to touch nfree and freeList */
+	if (IS_PARTITIONED(hctl))
+		SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+	/* add the record to the appropriate freelist. */
+	DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+	hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+	hctl->freeList[freelist_idx].nfree++;
+
+	if (IS_PARTITIONED(hctl))
+		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+	DynaHashReuse.element = NULL;
+	DynaHashReuse.hashp = NULL;
+	DynaHashReuse.freelist_idx = 0;
+}
+
 /*
  * hash_get_num_entries -- get the number of entries in a hashtable
  */
@@ -1382,7 +1517,9 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = hctl_nalloced(hashp->hctl);
+
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1529,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum -= hashp->hctl->freeList[i].nfree;
 	}
 
 	return sum;
@@ -1739,6 +1876,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl_nalloced_add(hctl, nelem);
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..2ffcde678a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From 1f3c1e78a0f8773422b0479e46031631cda30170 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 14 Mar 2022 17:22:26 +0300
Subject: [PATCH 3/4] [PGPRO-5616] fixed BufTable

Since elements now deleted before insertion in BufferAlloc, there is no
need in excess BufTable elements. And looks like it could be safely
declared as HASH_FIXED_SIZE.

Tags: bufmgr
---
 src/backend/storage/buffer/buf_table.c |  3 ++-
 src/backend/storage/buffer/freelist.c  | 13 +++----------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index c189555751e..55bb491ad05 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -63,7 +63,8 @@ InitBufTable(int size)
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION |
+								  HASH_FIXED_SIZE);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50f..f4733434a3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -455,8 +455,8 @@ StrategyShmemSize(void)
 {
 	Size		size = 0;
 
-	/* size of lookup hash table ... see comment in StrategyInitialize */
-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	/* size of lookup hash table */
+	size = add_size(size, BufTableShmemSize(NBuffers));
 
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
@@ -478,15 +478,8 @@ StrategyInitialize(bool init)
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
-	 *
-	 * Since we can't tolerate running out of lookup table entries, we must be
-	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(NBuffers);
 
 	/*
 	 * Get or create the shared strategy control block
-- 
2.35.1


From d62eaef77476f50c7f962c451ce22e00df24b298 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 20 Mar 2022 12:32:06 +0300
Subject: [PATCH 4/4] [PGPRO-5616] reduce memory allocation for non-partitioned
 dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list in this case.

Tags: bufmgr
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index f774d09972c..707a321ba49 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -177,18 +177,6 @@ typedef union
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -216,6 +204,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -291,7 +291,7 @@ static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void free_reused_entry(HTAB *hashp);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -570,7 +570,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -579,7 +580,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -689,11 +690,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

#48Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#47)
6 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

Good day, Kyotaoro-san.
Good day, hackers.

В Вс, 20/03/2022 в 12:38 +0300, Yura Sokolov пишет:

В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:

At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key. v8 uses that of the old key (at the
time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition. But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case. I suspect that ththat the frequent freelist
starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

Yeah. I think so generally.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists.

Anyway, it's an extreme condition and the starvation happens only at a
neglegible ratio.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

Mmm. Doesn't matter:p

I lost access to Xeon 8354H, so returned to old Xeon X5675.

...

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

...

I've checked. Looks like something had changed on the server, since
old master commit behaves now same to new one (and differently to
how it behaved in October).
I remember maintainance downtime of the server in november/december.
Probably, kernel were upgraded or some system settings were changed.

One thing I have a little concern is that numbers shows 1-2% of
degradation steadily for connection numbers < 17.

I think there are two possible cause of the degradation.

1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
This might cause degradation for memory-contended use.

2. nallocs operation might cause degradation on non-shared dynahasyes?
I believe doesn't but I'm not sure.

On a simple benchmarking with pgbench on a laptop, dynahash
allocation (including shared and non-shared) happend about at 50
times per second with 10 processes and 200 with 100 processes.

I don't think nalloced needs to be the same width to long. For the
platforms with 32-bit long, anyway the possible degradation if any by
64-bit atomic there doesn't matter. So don't we always define the
atomic as 64bit and use the pg_atomic_* functions directly?

Some 32bit platforms has no native 64bit atomics. Then they are
emulated with locks.

Well, and for 32bit platform long is just enough. Why spend other
4 bytes per each dynahash?

I don't think additional bytes doesn't matter, but emulated atomic
operations can matter. However I'm not sure which platform uses that
fallback implementations. (x86 seems to have __sync_fetch_and_add()
since P4).

My opinion in the previous mail is that if that level of degradation
caued by emulated atomic operations matters, we shouldn't use atomic
there at all since atomic operations on the modern platforms are not
also free.

In relation to 2 above, if we observe that the degradation disappears
by (tentatively) use non-atomic operations for nalloced, we should go
back to the previous per-freelist nalloced.

Here is version with nalloced being union of appropriate atomic and
long.

Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves entry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

Second, I confirm there is problem with freelist spreading.
If I keep entry's freelist_idx, then one freelist is crowded.
If I use new entry's freelist_idx, then one freelist is emptified
constantly.

Third, I found increased concurrency could harm. When popular block is
evicted for some reason, then thundering herd effect occures: many
backends wants to read same block, they evict many other buffers, but
only one is inserted. Other goes to freelist. Evicted buffers by itself
reduce cache hit ratio and provocates more work. Old version resists
this effect by not removing old buffer before new entry is successfully
inserted.

To fix this issues I made following:

# Concurrency

First, I limit concurrency by introducing other lwlocks tranche -
BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
128).
If backend doesn't find buffer in buffer table and wants to introduce
it, it first calls
LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
If lock were acquired, then it goes to eviction and replace process.
Otherwise, it waits lock to be released and repeats search.

This greately improve performance for > 400 clients in pgbench.

I tried other variant as well:
- first insert entry with dummy buffer index into buffer table.
- if such entry were already here, then wait it to be filled.
- otherwise find victim buffer and replace dummy index with new one.
Wait were done with shared lock on EvictPartitionLock as well.
This variant performed quite same.

Logically I like that variant more, but there is one gotcha:
FlushBuffer could fail with elog(ERROR). Therefore then there is
a need to reliable remove entry with dummy index.
And after all, I still need to hold EvictPartitionLock to notice
waiters.

I've tried to use ConditionalVariable, but its performance were much
worse.

# Dynahash capacity and freelists.

I returned back buffer table initialization:
- removed FIXES_SIZE restriction introduced in previous version
- returned `NBuffers + NUM_BUFFER_PARTITIONS`.
I really think, there should be more spare items, since almost always
entry_alloc is called at least once (on 128MB shared_buffers). But
let keep it as is for now.

`get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
before falling back to `entry_alloc`, and probing is changed from
linear to quadratic. This greately reduces number of calls to
`entry_alloc`, so more shared memory left intact. And I didn't notice
large performance hit from. Probably there is some, but I think it is
adequate trade-off.

`free_reused_entry` now returns entry to random position. It flattens
free entry's spread. Although it is not enough without other changes
(thundering herd mitigation and probing more lists in get_hash_entry).

# Benchmarks

Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
18 cores per socket + hyper-threading - upto 72 virtual core total.
turbo-boost disabled
Linux 5.10.103-1 Debian.

pgbench scale 100 simple_select + simple select with 3 keys (sql file
attached).

shared buffers 128MB & 1GB
huge_pages=on

1 socket
conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 27882 | 27738 | 32735 | 32439
2 | 54082 | 54336 | 64387 | 63846
3 | 80724 | 81079 | 96387 | 94439
5 | 134404 | 133429 | 160085 | 157399
7 | 185977 | 184502 | 219916 | 217142
17 | 335345 | 338214 | 393112 | 388796
27 | 393686 | 394948 | 447945 | 444915
53 | 572234 | 577092 | 678884 | 676493
83 | 558875 | 561689 | 669212 | 655697
107 | 553054 | 551896 | 654550 | 646010
139 | 541263 | 538354 | 641937 | 633840
163 | 532932 | 531829 | 635127 | 627600
191 | 524647 | 524442 | 626228 | 617347
211 | 521624 | 522197 | 629740 | 613143
239 | 509448 | 554894 | 652353 | 652972
271 | 468190 | 557467 | 647403 | 661348
307 | 454139 | 558694 | 642229 | 657649
353 | 446853 | 554301 | 635991 | 654571
397 | 441909 | 549822 | 625194 | 647973

1 socket 3 keys

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 16677 | 16477 | 22219 | 22030
2 | 32056 | 31874 | 43298 | 43153
3 | 48091 | 47766 | 64877 | 64600
5 | 78999 | 78609 | 105433 | 106101
7 | 108122 | 107529 | 148713 | 145343
17 | 205656 | 209010 | 272676 | 271449
27 | 252015 | 254000 | 323983 | 323499
53 | 317928 | 334493 | 446740 | 449641
83 | 299234 | 327738 | 437035 | 443113
107 | 290089 | 322025 | 430535 | 431530
139 | 277294 | 314384 | 422076 | 423606
163 | 269029 | 310114 | 416229 | 417412
191 | 257315 | 306530 | 408487 | 416170
211 | 249743 | 304278 | 404766 | 416393
239 | 243333 | 310974 | 397139 | 428167
271 | 236356 | 309215 | 389972 | 427498
307 | 229094 | 307519 | 382444 | 425891
353 | 224385 | 305366 | 375020 | 423284
397 | 218549 | 302577 | 364373 | 420846

2 sockets

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 27287 | 27631 | 32943 | 32493
2 | 52397 | 54011 | 64572 | 63596
3 | 76157 | 80473 | 93363 | 93528
5 | 127075 | 134310 | 153176 | 149984
7 | 177100 | 176939 | 216356 | 211599
17 | 379047 | 383179 | 464249 | 470351
27 | 545219 | 546706 | 664779 | 662488
53 | 728142 | 728123 | 857454 | 869407
83 | 918276 | 957722 | 1215252 | 1203443
107 | 884112 | 971797 | 1206930 | 1234606
139 | 822564 | 970920 | 1167518 | 1233230
163 | 788287 | 968248 | 1130021 | 1229250
191 | 772406 | 959344 | 1097842 | 1218541
211 | 756085 | 955563 | 1077747 | 1209489
239 | 732926 | 948855 | 1050096 | 1200878
271 | 692999 | 941722 | 1017489 | 1194012
307 | 668241 | 920478 | 994420 | 1179507
353 | 642478 | 908645 | 968648 | 1174265
397 | 617673 | 893568 | 950736 | 1173411

2 sockets 3 keys

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 16722 | 16393 | 20340 | 21813
2 | 32057 | 32009 | 39993 | 42959
3 | 46202 | 47678 | 59216 | 64374
5 | 78882 | 72002 | 98054 | 103731
7 | 103398 | 99538 | 135098 | 135828
17 | 205863 | 217781 | 293958 | 299690
27 | 283526 | 290539 | 414968 | 411219
53 | 336717 | 356130 | 460596 | 474563
83 | 307310 | 342125 | 419941 | 469989
107 | 294059 | 333494 | 405706 | 469593
139 | 278453 | 328031 | 390984 | 470553
163 | 270833 | 326457 | 384747 | 470977
191 | 259591 | 322590 | 376582 | 470335
211 | 263584 | 321263 | 375969 | 469443
239 | 257135 | 316959 | 370108 | 470904
271 | 251107 | 315393 | 365794 | 469517
307 | 246605 | 311585 | 360742 | 467566
353 | 236899 | 308581 | 353464 | 466936
397 | 249036 | 305042 | 344673 | 466842

I skipped v10 since I used it internally for variant
"insert entry with dummy index then search victim".

------

regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

v11-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v11-bufmgr-lock-improvements.patchDownload
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recent use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 4e5695ec50a4ade734375ba88111a44e645e1d79 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 27 Mar 2022 14:36:58 +0300
Subject: [PATCH 2/4] prevent thundering herd and limit concurrency in bufmgr.

Benchmark shows with huge number of clients concurrent evicters that try
to load same page starts to content a lot on partition locks and
freelists.
Situation can be worse than old behaviour with simultaneous lock
acquiring for old and new partitions since page were not deleted before
new page inserted.

This patch adds other locks tranche that prevents concurrent loading of
same buffer.

Tags: bufmgr
---
 src/backend/storage/buffer/bufmgr.c   | 11 +++++++++++
 src/backend/storage/buffer/freelist.c |  8 ++++----
 src/backend/storage/lmgr/lwlock.c     |  5 +++++
 src/include/storage/buf_internals.h   |  5 +++++
 src/include/storage/lwlock.h          |  6 +++++-
 5 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7dbfc90aaa..4c6c57e0ea6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1107,6 +1107,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	LWLock	   *newEvictPartitionLock;	/* buffer partition lock for it */
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
 	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
@@ -1122,7 +1123,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
 	newPartitionLock = BufMappingPartitionLock(newHash);
+	newEvictPartitionLock = BufEvictPartitionLock(newHash);
 
+retry:
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
 	buf_id = BufTableLookup(&newTag, newHash);
@@ -1170,6 +1173,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	/*
+	 * Prevent "thundering herd" problem and limit concurrency.
+	 */
+	if (!LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE))
+		goto retry;
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1380,6 +1389,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
+		LWLockRelease(newEvictPartitionLock);
 
 		*foundPtr = true;
 
@@ -1426,6 +1436,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	UnlockBufHdr(buf, buf_state);
 
 	LWLockRelease(newPartitionLock);
+	LWLockRelease(newEvictPartitionLock);
 
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50f..36218975200 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
 	 *
 	 * Since we can't tolerate running out of lookup table entries, we must be
 	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
+	 * usage is of course NBuffers entries. But due to concurrent
+	 * access to numerous free lists in dynahash we can miss free entry that
+	 * moved between free lists. So it is better to have some spare free entries
+	 * to reduce probability of entry allocations after server start.
 	 */
 	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8f7f1b2f7c3..08e7cb6b031 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -155,6 +155,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LockFastPath",
 	/* LWTRANCHE_BUFFER_MAPPING: */
 	"BufferMapping",
+	/* LWTRANCHE_BUFFER_EVICT: */
+	"BufferEvict",
 	/* LWTRANCHE_LOCK_MANAGER: */
 	"LockManager",
 	/* LWTRANCHE_PREDICATE_LOCK_MANAGER: */
@@ -525,6 +527,9 @@ InitializeLWLocks(void)
 	lock = MainLWLockArray + BUFFER_MAPPING_LWLOCK_OFFSET;
 	for (id = 0; id < NUM_BUFFER_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_BUFFER_MAPPING);
+	lock = MainLWLockArray + BUFFER_EVICT_LWLOCK_OFFSET;
+	for (id = 0; id < NUM_BUFFER_EVICT_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_BUFFER_EVICT);
 
 	/* Initialize lmgrs' LWLocks in main array */
 	lock = MainLWLockArray + LOCK_MANAGER_LWLOCK_OFFSET;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..a1bb6ce60a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -126,9 +126,14 @@ typedef struct buftag
  */
 #define BufTableHashPartition(hashcode) \
 	((hashcode) % NUM_BUFFER_PARTITIONS)
+#define BufTableEvictPartition(hashcode) \
+	((hashcode) % NUM_BUFFER_EVICT_PARTITIONS)
 #define BufMappingPartitionLock(hashcode) \
 	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
 		BufTableHashPartition(hashcode)].lock)
+#define BufEvictPartitionLock(hashcode) \
+	(&MainLWLockArray[BUFFER_EVICT_LWLOCK_OFFSET + \
+		BufTableEvictPartition(hashcode)].lock)
 #define BufMappingPartitionLockByIndex(i) \
 	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c3d5889d7b2..12960cb79f5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -81,6 +81,7 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
+#define NUM_BUFFER_EVICT_PARTITIONS  (NUM_BUFFER_PARTITIONS * 8)
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
@@ -92,8 +93,10 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
-#define LOCK_MANAGER_LWLOCK_OFFSET		\
+#define BUFFER_EVICT_LWLOCK_OFFSET	\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
+#define LOCK_MANAGER_LWLOCK_OFFSET		\
+	(BUFFER_EVICT_LWLOCK_OFFSET + NUM_BUFFER_EVICT_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
 #define NUM_FIXED_LWLOCKS \
@@ -179,6 +182,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_REPLICATION_SLOT_IO,
 	LWTRANCHE_LOCK_FASTPATH,
 	LWTRANCHE_BUFFER_MAPPING,
+	LWTRANCHE_BUFFER_EVICT,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_HASH_JOIN,
-- 
2.35.1


From 0e4ce63c9fb8c53565af1b1ef11e0c7c224663ec Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 3/4] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com

Also, get_hash_entry is modified to try NUM_FREELISTS/4 partitions
before falling back to allocator. This reduce need for shared allocation
a lot without noticable harm to performance.

---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 271 ++++++++++++++++++-------
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 213 insertions(+), 74 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4c6c57e0ea6..a5a34133d29 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1349,7 +1349,7 @@ retry:
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1545,7 +1545,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..436b6f5af41 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -98,6 +98,8 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "common/pg_prng.h"
+#include "port/atomics.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -138,8 +140,7 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,16 +148,26 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
+typedef union
+{
+#if SIZEOF_LONG == 4
+	pg_atomic_uint32 a;
+#else
+	pg_atomic_uint64 a;
+#endif
+	long		l;
+}			nalloced_t;
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -170,7 +181,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -195,6 +206,7 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	nalloced_t	nalloced;		/* number of entries allocated */
 
 #ifdef HASH_STATISTICS
 
@@ -254,6 +266,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -269,6 +290,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
+static void free_reused_entry(HTAB *hashp);
 static void hdefault(HTAB *hashp);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
@@ -293,6 +315,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -306,6 +334,42 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+static inline long
+hctl_nalloced(HASHHDR *hctl)
+{
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		return (long) pg_atomic_read_u32(&hctl->nalloced.a);
+#else
+		return (long) pg_atomic_read_u64(&hctl->nalloced.a);
+#endif
+	return hctl->nalloced.l;
+}
+
+static inline void
+hctl_nalloced_add(HASHHDR *hctl, long v)
+{
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		pg_atomic_fetch_add_u32(&hctl->nalloced.a, (int32) v);
+#else
+		pg_atomic_fetch_add_u64(&hctl->nalloced.a, (int64) v);
+#endif
+	else
+		hctl->nalloced.l += v;
+}
+
+static inline void
+hctl_nalloced_init(HASHHDR *hctl)
+{
+	hctl->nalloced.l = 0;
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		pg_atomic_init_u32(&hctl->nalloced.a, 0);
+#else
+		pg_atomic_init_u64(&hctl->nalloced.a, 0);
+#endif
+}
 
 /************************** CREATE ROUTINES **********************/
 
@@ -534,6 +598,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hctl->num_partitions = info->num_partitions;
 	}
 
+	hctl_nalloced_init(hctl);
+
 	if (flags & HASH_SEGMENT)
 	{
 		hctl->ssize = info->ssize;
@@ -932,6 +998,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +1011,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1073,10 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long		nentries;
+
+		nentries = hctl_nalloced(hctl) - hctl->freeList[0].nfree;
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1120,9 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1136,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1160,21 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,7 +1183,12 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (unlikely(DynaHashReuse.element != NULL))
+					free_reused_entry(hashp);
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
@@ -1100,6 +1196,7 @@ hash_search_with_hash_value(HTAB *hashp,
 					 hashp->tabname);
 
 			currBucket = get_hash_entry(hashp, freelist_idx);
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1292,87 +1389,121 @@ hash_update_hash_key(HTAB *hashp,
  * Allocate a new hashtable entry if possible; return NULL if out of memory.
  * (Or, if the underlying space allocator throws error for out-of-memory,
  * we won't return at all.)
+ * Return element stored with HASH_REUSE if any.
  */
 static HASHBUCKET
 get_hash_entry(HTAB *hashp, int freelist_idx)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	HASHBUCKET	newElement;
+	bool		allocFailed = false;
+	int			borrow_from_idx;
+	int			num_freelists;
+	int			ntries;
+	int			d;
 
-	for (;;)
+	if (unlikely(DynaHashReuse.element != NULL))
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
-		if (IS_PARTITIONED(hctl))
-			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
+		Assert(DynaHashReuse.hashp == hashp);
 
-		/* try to get an entry from the freelist */
-		newElement = hctl->freeList[freelist_idx].freeList;
+		newElement = DynaHashReuse.element;
+		DynaHashReuse.element = NULL;
+		DynaHashReuse.hashp = NULL;
 
-		if (newElement != NULL)
-			break;
+		return newElement;
+	}
 
-		if (IS_PARTITIONED(hctl))
-			SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+	num_freelists = IS_PARTITIONED(hctl) ? NUM_FREELISTS : 1;
+	ntries = num_freelists / 4 + 1;
+	borrow_from_idx = freelist_idx;
+	d = 1;
 
-		/*
-		 * No free elements in this freelist.  In a partitioned table, there
-		 * might be entries in other freelists, but to reduce contention we
-		 * prefer to first try to get another chunk of buckets from the main
-		 * shmem allocator.  If that fails, though, we *MUST* root through all
-		 * the other freelists before giving up.  There are multiple callers
-		 * that assume that they can allocate every element in the initially
-		 * requested table size, or that deleting an element guarantees they
-		 * can insert a new element, even if shared memory is entirely full.
-		 * Failing because the needed element is in a different freelist is
-		 * not acceptable.
-		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+	for (; ntries || !allocFailed; ntries--)
+	{
+		if (ntries == 0)
 		{
-			int			borrow_from_idx;
-
-			if (!IS_PARTITIONED(hctl))
-				return NULL;	/* out of memory */
-
-			/* try to borrow element from another freelist */
+			/*
+			 * No free elements in first NUM_FREELISTS/4 freelists. To reduce
+			 * contention we prefer now to try to get another chunk of buckets
+			 * from the main shmem allocator. If that fails, though, we *MUST*
+			 * loop through all the remaining freelists before giving up.
+			 * There are multiple callers that assume that they can allocate
+			 * every element in the initially requested table size, or that
+			 * deleting an element guarantees they can insert a new element,
+			 * even if shared memory is entirely full. Failing because the
+			 * needed element is in a different freelist is not acceptable.
+			 */
+			allocFailed = !element_alloc(hashp, hctl->nelem_alloc,
+										 freelist_idx);
 			borrow_from_idx = freelist_idx;
-			for (;;)
-			{
-				borrow_from_idx = (borrow_from_idx + 1) % NUM_FREELISTS;
-				if (borrow_from_idx == freelist_idx)
-					break;		/* examined all freelists, fail */
+			ntries = num_freelists;
+			d = 1;
+		}
 
-				SpinLockAcquire(&(hctl->freeList[borrow_from_idx].mutex));
-				newElement = hctl->freeList[borrow_from_idx].freeList;
+		/* if partitioned, must lock to touch nfree and freeList */
+		if (IS_PARTITIONED(hctl))
+			SpinLockAcquire(&hctl->freeList[borrow_from_idx].mutex);
 
-				if (newElement != NULL)
-				{
-					hctl->freeList[borrow_from_idx].freeList = newElement->link;
-					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
+		newElement = hctl->freeList[borrow_from_idx].freeList;
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-					return newElement;
-				}
+		if (newElement != NULL)
 
+		{
+			Assert(hctl->freeList[borrow_from_idx].nfree > 0);
+			hctl->freeList[borrow_from_idx].freeList = newElement->link;
+			hctl->freeList[borrow_from_idx].nfree--;
+			if (IS_PARTITIONED(hctl))
 				SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
-			}
 
-			/* no elements available to borrow either, so out of memory */
-			return NULL;
+			return newElement;
 		}
+
+		if (IS_PARTITIONED(hctl))
+			SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
+
+		/* Check num_freelists is power of 2 */
+		Assert((num_freelists & (num_freelists - 1)) == 0);
+		/* Quadratic probing guarantees we loop through all entries. */
+		borrow_from_idx = (borrow_from_idx + d++) & (num_freelists - 1);
+	}
+
+	return NULL;				/* out of memory */
+}
+
+/* Return entry stored with HASH_REUSE into appropriate freelist. */
+static void
+free_reused_entry(HTAB *hashp)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = 0;
+
+	Assert(DynaHashReuse.hashp == hashp);
+
+	/*
+	 * Testing shows best strategy is spread reused entry in random way.
+	 * Otherwise there is a chance for pathological case with crowding at
+	 * partition of hot element.
+	 */
+	if (IS_PARTITIONED(hctl))
+	{
+		freelist_idx = pg_prng_int32p(&pg_global_prng_state);
+		freelist_idx %= NUM_FREELISTS;
 	}
 
-	/* remove entry from freelist, bump nentries */
-	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	/* if partitioned, must lock to touch nfree and freeList */
+	if (IS_PARTITIONED(hctl))
+		SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+	/* add the record to the appropriate freelist. */
+	DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+	hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+	hctl->freeList[freelist_idx].nfree++;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
 
-	return newElement;
+	DynaHashReuse.element = NULL;
+	DynaHashReuse.hashp = NULL;
 }
 
 /*
@@ -1382,7 +1513,9 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = hctl_nalloced(hashp->hctl);
+
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1525,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum -= hashp->hctl->freeList[i].nfree;
 	}
 
 	return sum;
@@ -1739,6 +1872,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl_nalloced_add(hctl, nelem);
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index a1bb6ce60a0..c1c6fdc1e33 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -333,7 +333,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From 121d620126c441289c440ef094d82dcb80d80d17 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 20 Mar 2022 12:32:06 +0300
Subject: [PATCH 4/4] reduce memory allocation for non-partitioned dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list in this case.

Tags: bufmgr
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 436b6f5af41..aba60109d04 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -178,18 +178,6 @@ typedef union
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -217,6 +205,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -291,7 +291,7 @@ static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void free_reused_entry(HTAB *hashp);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -570,7 +570,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -579,7 +580,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -689,11 +690,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

1socket.gifimage/gif; name=1socket.gifDownload
GIF89aX��###+++333<<<CCCMMMSSS\\\bbbkkkrrrzzz������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��s	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L����Li�u���) �L0�hC 9c��L���30k����f��c7!����Ygt
������:@`����7w��@���q�1ny�P@ �A�5(��	�( `����<0���"8���,`].�Q�~�%��Jj�g�g��QG�MP�q
�G�-���Y(�@����T u.h��&5 !A�D("x��|""P�?0@(���-�M@P�mc.<0�f< $�`�����
�j�������K��1��T����@�i(�e����m�a.h��5���FZ���;�I���$d
��f��@�l	��F6����Zj}�)��f�J���Qj��,P�!�Y�9�,d<JA��
��IP(���B�������T�h��J�A��!�m�*D�f<zF��y�\���;nA(J]L������f.�l��B�n��@����47Z�K��Ah<`��W��������m��1Kl.8���8�l�<��1B�H�������B�1�*>�e���G�xq�������7�<�L<P�xj���D-R��j�C
J���b�;��t@�\�����g|�h��x:g�����fHJl@h�����@`(�j�	����+���g*��I����?2n��AJ�->P����������'O�����)n����)u`�9}�>��j4g\_�l���_�����v����<�W�0��
l��q�@n�*�
�8@y�?� 8��G������7��z�� �GH����(L�
W�����0��gH����8��w������@��H�"��HL����&:����&�H�*Vq#	(&�D���`#�@�lQ!��D @����!w���H9���u��I�X6~Hh<��&��B��`�"�H3r� i��	2I�T�I�P��<Lp����(�8��#}`�*W��V^P"�D�#2��R"n�.w��^��#��0�����0�!8�����4L�DR��$�-�M����)��	Q�3� i)���T����$%D�9��\3"��K
r<`�JID�(/��zN$��|��W1�AD�}fFp�N����
��@���	��:X(A�>O����$��� 
`�@�?���
@��VY�=P���5���|�+�0T@4��<��U�"
����@����~��!
0�?��_���/�ifZ��n4�;��OT����JkBi�����odHD#�����P�+z�z��`���Ws���QeM(Z�
2�"����@b�f����������4��M���g���3L�>�@�F�t

���;5��|��0�3`
C�=�1�$����c\{&��������7P$T W�TC����"!��-mm�@���[o+�����E�rWV������C/�X��gD�]@�� �*�[Pz@c��w	��@+���.���{
��J���u-(S:��uO�g��{��^���P�6���o��R��J��m 
]���XRA��1�Y����@j<6C$���%� ��[>��i���q�QP^@c:���V��4/��\ F�s��V30���snA��i5�i�n�1����%9�1���	���C>,��eZ(0�^,+F[q:"S�2x���z�e����L��� }WFF-@���H�6��!p��.�k�V�������
[������kN���i���AGD����#�dc`�����y�4R#���* �/�������o��;y�Pu(���2s�
��x*P3b���X���f����y��,lVZY{���"���O:t����Vt�-��1P�B%�a�!��~mW��a*g9�_so���8�y��P��S @�C���uH�Vn�XB �=nxh�-�������A����j�ytIE�c��\ `�b�3����e�i���!��2�2�F�}��0��+bu�{;��A��^K]�����E����[���<�_N�������E�W6K��.�c���Bw�"���E�[>���������y@=e��k~�A�}�h�����~�b��T�o��x%��$Y�r��$���]Vvb��b��X4�$�Rc+��c�G x0�x�|��K��^�yT�j��r/�w�o���T�aLRf�f��L�D�}��}��}pbhpj�2�:��q�.*c7!�#hWwu�'�
�L�F�����gy��et�	�Te�j�A=AO�WT�l=OL��T�3�0�-�0-CS @,���6�Y�j!|ZGn���n���t8$�qY��Fr|����v�@
�`�������K�z�U�7;���e��W�a0�ZxQ3�h�	j��\�x�k��$��TV&$��D�(nU����%PD��cG���Dl���H1�q��\����������:q��T�ggKa���$���H���xR����L��l��[��	�RF1�,�.�JY���U�������UP2�9�$���<)�,��}8WA}Or�_�sE�$K��/��P�R9�TY�Vy�X��Z��\��^��`�b9�dY�fy�h��j��l��n��p�r9�tY�vy�x��z	�B��~ym % (����K��D?y�oI
A MK��z�Qw 
�I�R�E�lP�`�qK�Kt��]��
�pn��DK��P
��%
����Pa��
�
�@K����)
l�A��
����YS�hO	R�0H	T�Pm���A�M�P�B������9��
A
lul 
�8� M�`�9S@��il0?y�����r���*P
u������Y�u )�5�M�Y��R���=���
���P M1:@�I�:
�s$�K'@�v!	������
�@����D���4�):���������	q�@8�t@
�������8�
��S���9R�t*MR0S�H�p��
A	��d�_�X
���n@��T��K��!
M��������C �������{J)��	�$���{�~���
�P����D	!���i����Ua���%�"�J�B;��
�5�@	������"���!���*	����
M�@�J�
�J	��H���\QH�����
��^��@g`k�1L�
�
`
��@"�h0�
kB�x�������$����q�M;��"�tP�Q��(�+sQ$�K�nP�0
����$��$@�
M+�H&*MS��	�
EM�y��fbk�����D�VV`�=0���v!D�z�@�.���LH���Z��h*$�v`����T
fqt�U� 
q�k�'P(�k�	�����$�A�m�2��
��H�	pS��
P0@ h^+f{�������[�#�P�m;�q���[�Q
w���m 
� Y�KC+��gM0�Ms��8W
�p�lP_$�K`��I
=��K;�([N�L���
M�{!�L#`+�����{J`k�&T��y��MN���J`Jk �����J� j�"��DA
���9�����������r�*��[��[�t�V�@�P���I�G��o����}8W�=M\�a���h�A���3���`������� *�L5�������A1
�ItP����:��q�����
���� 
u0���k �I
D��P'P���\�0sA�

V���������K�r'(p����������������<��\��|��|�&���������l�<�L@�`�\���!�4� �p��������|�, 0��,!���@0 �������V�|�<��d����P��t�l��LH�0���l���)�!XS����8a��T�_�r��U�#�����)�����C�"��QZ�:��{u��)����U+aZ�Q�
�s�<�*�������
]�HB���Q�����A�" M0�k@���w�Kw@� P+-k��q�,�|�KA<N]�g���k0�"�@4�G+��p�-��HI�z������k �f���k���	��EL�:�C�m���N��� 
����k:z,]�k�A-��}����HS�t��V�t��k��]�����W
��uPGl��Y�ZW�)�],
��G
v�8*)�wP
��K"�����T����$� 	"{��K_*��t)@�l tP� 
�����Gm����_1����
1����4�j�Z,����M��L�����}*	�zMA���%�%���u�Y�~�����}�o��R�2���_������������vM�ea��iy�X~�M���������P��_n,��j)#~��=��h{��P���g��zZ� �*��9��1���J����|z�����������a��]�kQq>�*����d���;����r[�m��<�
tJ��~Y\�tI���o���qY���Q��N����a��
���IT
�}@7�J��`��1�d���;��q
����^n|�T���p~~p�������n�=��!���'�{p
d~
�~�EtA`��	2�N�T��P8�J8@�2y�J��C�`�ov�����J1�����	e�I^P	�`
|0&�>�J��A�QP�
���0�q�	q	1`�dX��	fJf�	��%�����RQ����^�%i���Ne���P����~�,J��._	^����N��UP]p�x�
����d�q�>G��J3 �)
�^}P�P���TFyp�@�	!)���o��J}����ig�o��
��
������4	9	��N^_��4�����,���	I�IW`2�48��p����$����U��o
���@}�y8p4���&�&y�3@4p7�8 ��B�@�.X�0xp��(y��t*�-\�r� 8�T.�=~�H(#!�%M�D�R�J�-]��S�L�5Q��t2��M6s��������p����0Ub�Q%�P��|^���'"��`�'�>�
!2e��*[��Z��0����n�=e�!>Qp�@�U�
�m�M�V(]��?�Y�d��� ���U�����2�$�\�`u���j X�_mMY�cU]e�&�Cf*�|d�%�0
`�iD����Sy���]�t��e22�E����|�":�-�.dX�B�'�)?}����u�*m9/���x�.H�<��>f��*WL���TL���>��+���B/�iFN*DX�&�b�E�B�H��*�+c�9�e���C
U���3��&Bn��$��������9����I'�����Nr�('mY$[&�c����;/,�D��dp��"�,��\8��[����B��i��&�I��T�3PA�*��L
�(oQ��5�<����0�R4&\��!���f�T�TUWe��9)�*�l��[R9�++�4NZ�$2��[D��Z-T%�PvYi��P�Wd�jS��j��������b�L�b�>�`l�
��^{����
���{9���p�e�B���b������c�Lj`��L"S%
t��c�?�v'����2B2��<����#�_�S)���������M
��d��&��[�`���2:�����+��� ��A<��#M��D�=2r!QI-:m����}A���`?�en�
����N)$(.��jxJ(�a���d�f�q�#	l�}<&�����T��n����l�B=�0K�N�����_�=v����$Db�Ev���<����l��Y`A�C�?���d��p�t��u��e��\�b��x*��Q�Jf�����?~�UB���dz��X'H�p��!~2�O`��Y�,����G{��`e�G�U��db��|!Z11DF�`�|��B�8��\4��a%c�5��2�.� -��%���	]�B��5����u��yg�\`b��
�|S�.�E,vq�R*�-\�?/J�O�������$y����/����G�� &�!�_��#�8�!�H�-i �D���U1��dd��`
�H
0�|� @)sq�T�����u�e��/���&ni4@�H�B�P P-����bvE�pLPDs��T&3w9�����f���X���e@�\��d. ���P ���''�Oy�3���EVQ>���YLd8@�
��@p�����<1��������V�@��^4&�H���g(��p
t��6�hEs�Q�N&r%1�${�K.���
����T4hi.(��������TMR�v��_kX�:V����gEkZ��V�r�#k��V����w�k^���W�R��k`���4��-��c&��T�-{i6���&.p��Bf������!���D�@N)V���x�'�NQ����pQ�6���*{K�jE	#�HA��1Kj��*]cB���P�$,�u��D�m���.uhS����s�x^��w%l@n.j�^{�b���o~g�*������������Ha�O�}.�.��J�%�&��`>�!�,���^��x����d_�R� E
H����h�/n'�`�~��(���x�����D�q��0��l����d��@v��b�r��$��0�1��$���'`������\Gb��$������2�y
!����:6�=��'0a�B{����'x�!j&\n`���&�y�=��7�8:s�S'1���hWX*-5�<j'y���1������X%�ns��7[b�B�Q0�F]o�	��/2�f��f3������")���(-`/W�Vu(���_s��p�������v�j
(�<��\��8��S�a�����	�K�\�b
C��@9"��os���8�"!(�$��u����<�\Pl �����/�9J~�
���y�c.
�i*9Dt>t���S@B�b�7���@�������W��(D��!0�%-�z�G-$��*a������B)P�K�r��}����j����/D�����-�G|q[!�Zt$(@��=����������.���$h��_R���� ���e-DP�a$|��N�;���|\�����k�p������t�:1!�����.B{p�	���W?����KB(�[��0D����_87�~�aH	hK�.4��|`�+t����������D,���@r�|�:�����m���/�p/@@���)�k�[���������*�DA����7���L��>C@���������#�����!T�|y��(�*8"d�A�(�-��&���!��P�Y��.�8��p�����3��[�#���C�8��D�B�����-�?9��VI�����@�-�?D������1H�Hl|��������L�V`�P�>M)����	(��\� %S*�VZ�WT%x�m�����(P�8D3�
@��b,l���8&5���B�f��dZ�]�����.����[Jf���P%{�*��{�'t�'}�%#�X���m���%�w����)�b�}���@�2`�*h.�Cz�������\��:�~��X"� �?���JP�CT���	��HP�)�����'���Is\$J��8���&	8�P��
@|�P�,������*�[���J������������������������$���+��O��
�_��R����E�������b$#
�C���`'�*���>*��/\�,�ZD
�T��������������Pc:����/��	�����j	�E�t����������M������0������?��:�$��C����TN�������\���:8;p9����(��	-���\��:B�����I�J��$N��=>������o�O�T���@����|!
�!]�����4P������UN���C`	U��l��-N��T�P%�����!��$O��	�LQ?\O���yQ�����X��M����Q�\��0� u����/8#=M1��"e��l��!�L0��(��R��8��� ,
�[X2��h<���S���� H�CbTXS�,��k<A(�:L����>�Shs��A�5V��=�m\Ah��FU�Q�@[[��I�EtH���CV���TMd1sP���=(�L��(�0�mUIT='������?|N����U?��t����+��+-V4<����\��g=C���"����@ku@�C�\]M���&�U�;����r��c�Tn�
�aW\1+���j�y��A8��K�J`H>�~��[8��H�@E�X�V�VR%�u%�}��8z�����B����nEX�����8s��e�����u�d��lb���!�����a������U��PP�r�'t,G~����<�w
N��ld��P0Z�TpZ�G~�)�H��;��V|
�S�Z:&�x�P@�H�����%3zX ��e�����(`Z��*���u�*�)i?

D@��Z��lR������d��k2��lJ�	��J���������������������+�P�UbE�-��-�����X�U���\aU���\/�.��^[�xd���5j��i�N������MA$`��-���D��@�k�V`�������������
�����,�9X�a����s?}�Rv�6�V�
�`��1����
���[�&���X��K��-	�5���`t��M��8������<�K�=B��!��>RH����Yx
vb�[[*�L`�����	��/�?7���*�_3��%q���U7^�W@0�����c���+�t3bX� �����1���X�E����5P���`K������Qg���k����R�Z8.h�/�NVF<R@7�3!@�.��O.�&��W�[sF�e��w-b��� 3!��cN�Zh�!U�������#�$h2E����m�;~W��� ����9�69��&������T�>B$�u~�G��'��Z`� 8@��/��L@- �U�g��6�KV����O@�h����W\�y���~�#@��8���f;��T���h�^�Qp�>�����i����@[kmj����E���!3���v:R��!�gf��������v����{�������90�9��N���U� e���[����'��S����w����h���G}�3��!8���=h�.j�n7h+�!���1��7��2�l�lh[�\0����AH���/��� �h=xm�����/�S�n:�6����O3h���lOsMn�3x�����&:\�h�Vd�n���n[EX���[Fk��{���9�Uo�c���8�������������������������-��7�G�W���8��AQX�xHT��HKP���
����hK(�<YXy�p���p [G���p��?�N�$����(�XlEh�W��qy����p�h�X`�r�����"?r����0E
������x�`�\�q"7r������O�(_�1��W����o�Qls<�'V<r8�9Y���a,���	_����p���x����@���� tc�J�F�_�Ew�`J,��E\�������i�MG�TG��
�P0�|u�UY�u�X/&3�uU��P����� �P0�TOu������Y��b?�KW�|tc����ZP�c��P`u��tn?�D�\�t���S7�D/�p$�j��vr��(Z���q�'��F�@��	PHq�@��	���\@��|�}_I`r�� x��w\�P_�v���F�_	|7Z�5��W	�Wrm�������|�@x����O	��[�%�W	�?GU�y�G	����x��*��q|-�Gx�?�4h��Oo�G�����xf�x��z�Wz��?x�^��G
�	{D��F`�
��H�h���{���Eo��o�3�O�����x���	��_���H�B�G	�����(������P\MR�p��|�G	����G����|���}�*�{l��W}��}��{e\�����w��_���{-W��7	�W\��|�z����������*w��b*�������w~�(��/��H���Is�~�8������P\�l��{�\
X� ��
���A�3�@H�����:��������Q#I�,y�J�%�)��K����A���9^��B
8=$*c��B%@���S�i0��Xpi��b��-�R�Y��J�����ExW��us-��7���}��m+�o]
	���)��U��sIp������v����^�X9)���v�Ek�U�N5);'�\���;'�
#����j���"��Z����y;�^��\�s1h��;����V@^��P�D�+-Z	*��% �_����-�1�~f�GK(
@��z�7�B� 
4�XC��R��r����Gaj@ j8XIj@x���e��(��I�9���x����y	M����������
A�����9=9 U<9����Q��*V�e���Y�@��4��^�9�_I�g��i��f.}�u�w�6��t���,����5�lP���%�xBR��TA�@�(��%�i�r����)�x/�
h����uk@i��F�&����"��K�&z���v��������`&��z�-���I����{.�f��.����.���;/���{/����/����/�<0y�|0�	+�0�
;�0�K<1�[|1�k�1�{�1�!�<2�%�|2�)��2�-��2�1�<3�5�|3�9��3�=��3�A=4�E}4�I+�4�M;�4�QK=5�U[}5�Yk�5�]{�5�a��p@;
1socket3.gifimage/gif; name=1socket3.gifDownload
2socket.gifimage/gif; name=2socket.gifDownload
2socket3.gifimage/gif; name=2socket3.gifDownload
GIF89aX��###+++333<<<CCCMMMSSS\\\bbbkkkrrrzzz������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��k	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L����}%XL���|gEH@ ��h*f�1�vz`@��b �^m�����'�$s3����EP�����fM�����a'�
7F��	v
�6�����]���
��o�6A����@���(,`f`��?80 ���I��%4��
�@�d�s�`��
4����@TGr$�bc�����X�Q��t�[r�G�2�@���@u�(��uBl�I��'P���&c
�Xj,v�%E|�@A2 '7VW���V�r�U���)�,�0A����<�b<���9t�����FDl
l�G��5YK'i2�^=zZ�2@�d���vrF%�i����������,d��g����,0)��Y'D�@�-��@�f������0Nh��
��@�.V�@y"dFl�j������bcr��P�hA���@�5J���	DA�
��!PP'����.�0A�j�,� ?:���)������`P��!��4"�&ll������A�P�!l���+�:_�U@2p	�	lti��*����M;<��c�~H�����
����%�o�5����X��An�����V���+���e��xq�Z��	q2A���'�y��"lL�<�0�-���y����R+A7���������qr����k{���{-� {-/6���:��t�9+���������H���K����&��O����a'<�vf�
�'A�u��lj����&��-�u2�T�'�-�j��z�<�Y0}���('��)l&�����$o�	7��M�����@��D��C���/���@$�����K���p10Z	kq5Hf��ru��d���M�n����.l5#C��F#��
t��
hP$��%��dL���!.�(�� �?��W��F:�����$'I�JZ�����&7��Nz����(GI�R����L�*W��V������,gI�Z������.w��^�����0�I�b�n�Hf2-��dJ��<	' �B�����6���<hd�
	�'�4����C���v���npC�IM��s ����p�A�@�@!P�t����@�i�{��fKc>K�O��Y��F7��� @{�FJ���T#
�8%�,��4"j�LgJ����$)������0��"����HJAz�����*B&��<�*#-���n�8e�FDj���t`�HN�����:F��O_
�8���4-[H�
����E�Q�$���u��T�C���~�>�`�8��v�^�jZ�1 ��d�@e���8�x�@ �:y��A��0�m!d5kYb���\*-"���N0hCT����Z$�dm� �!i���Cb�W�*���5@hG;��6��m�����vo���@`�C�p��
1�ay����� ��;�I��5�s���!��E|�;�	�����"1���&�?�@�0��y�q0&�P�6�@�"�����O
��*$��-���71@�>
�0&��� ��42�����w}���kw�N�x�%���S,W���iL��kc���>��K��@����
�O����bXq�?��'Gy��#z�W��sy)��=[6���F7k��Y�`��,�@�ue��lE7�V �C����� c��@x��V�)��t��Yu X)}3�LkW��mw�L���u�LD��Y}2��Z �HJk�(\�� ����z�����C^�#`�:%���v6H�U�j��|���#�Ry$M����=�VOM�x7-8	OH@���R��R���{�W�;C���,8����:8W�L���
����pmN3D���ibr�ak��6��I.:�����'�G1xw�����`�U�y���o����<�������;j�Z7��^���x=����$&��[���� P���i%�
�p����
�#�� ��P|�z��B@��`�f��i�&��X��jQ	�j"�N:�
w�#�[(f��cyL�����c�7����xvA�����-�0�����7���;}���X�^Qh���]�lo���'H�?�P{�
�6��B�U;<W�R�-)�i��$���[�Va�F0���l�SZ�lO�|�X��e��w{�~���t�D�]��jb�4�n��w���7����,�F`C	q}�UvX��x`5�)e�����m�O�#�~���A`vd������q�7r�"^G+dP�Rvu� ��du����k&�}�v>Ffldh���uR��4S�z+5%��z�B�A�%�{'�t��j|�f�20@}�����Q�bq�|��,��G������E5dx"{}�R`�	� �b@���7�w����h�W��EW�hU@�����������Ng)����w CD�v�����G�B�����tQ������X5��O�(m���a������%�}�(�� �L������@B s�X�6X�A{�c������J%��h!�Y)1��d�����o������1z�c
��I>\Ex�
 A�!��$5d��I�{!�;�
 i�����YG��fe�p�H��IQ)�\9�Q��eX�[e��^�m��ayc 
E�a�]u�D9RI�o9fIs������;��'��K����h�!V��V�Y�y�I1�i��1��7�Q�1�m��\!����1���I�l���y����������������9��Y��y����������������9��Y��y�����������������$��Y�	�
G������HelA�R1��yEPX�)�`P��y���
�y�YTP0�q
G���y�1���1�`Tj�p
Z�FaX�@��T!J�p�Fu
��HE����GU��T*j�GU.Z��T.:;jTz����cWMP�	��Q�-J�9��TP$�_q�<*�IEPzT�@?zT5:�����E@�	�Y2Z�����p
����(ZTE�������b���azT����z
�z
�@��p�K�������z��
���
�Y�H��:�Ie��O�9��iX�J���a�GE�ZT���ZeZTcj��
�����z
�@�F�E���%��zTO�P����Z����`�jX�Z���|��Z�V|��I��ZXP���G�G����
�Q�X�\���E�
����:��T�I�����
�0�7*����Q��8�SJX�������9����@p���
�������EP���Op���������Z����jX�@�{���FU9��) @$+��2{���@0�a�����6�PX^�
�}TZXE��)�@��P�
�P`��z�����T(��Y�pWF�H��[��&��@����@j�jT9�TO`��������q�I%�a�F��i�'�W�	�@���
��Z[��������[g��3���P����T��2k�]*���T�Ex�
}z�d{�V������]�@�NZ���������Y�+�W�[�����������'�+S� ��k�i:HP���-!�b)��4�C�5e8\|IP9)�S���%Q��C���cw�,��c�Pc��f��0��"�H�+'�� ��"|����B�a_s�`�����bIM`��;O
���&�P���^|N���4��
s�fLI�p�X��	�� A@P����b��e��h��j������
=Li@!L������A��d��L'%�v\@�|�C�H����mYt���a<a���0�f�zt"�/L'����l5\���<��\��|����������������� `�,�` �������<��\��|����.����.P0-9T��Q��\���]�$��q��-!�dK�)�h���~�I
js�P{����h �,M��m@�@g����>�&������d?0�(q��q�.-���@��)�	 �5�>MKr��D}W�����$����h�W=-����5����U�^]J)=�?�=�.a�wEq�g�I �T$O}���B}���
l�T?��-q5e�����-�������$��=��S�Hm� +�� ����� ��M}-S:��6D�F
n��+@`]SMM�6�Gu�I��e
�7���[�]�}A&��J��P�W�����@0�M��4%q0������SQ�35����t�
�
2���Th���M��� ��]��	�����p-���) �)�2�����aK���h��>S�������]{���1b
<����+`�$nt��n���vgt�
v��2��/>=�r�W|����+S��
���55�?���
�=SQ��	q��35�*>�j� �f��MSi����>N�D������+�p�����tN�-�go����;-�)��]�pWk�P>S�n�(��`�-S�.�-g�h@I^S�-Y>S6��+��
r ���
�^�h���Nqp��Nr@t���x��n��.�
���4����q�������HP�b~W*�q�	���LS�N��.�i��65�7Sq
s�Hp�K>N��x�
�-�D@�sW�<.S	��r��4�g�_�g��m����4>�S	��N�^Sh�r�	�0���~�m����5��^S'�Nh�'p� ��hp����1�
��@��T�����	r�h�D?��4%����nKm���x����	�@������pPx 0��0I���mxu���^�D��n0I�����1����%%
������_��x5�

���D�$�f�J=O���t+���=O�s��Y��[�?�+o��~PN������5�����%�[�@��_��^�2%����2����%!�ee� S��*��9����Wo���&��_�`#U����9f6���	n�������@�
D���
a��Up��R�Q�F�=~R�H�%M�DB�J�R�IsBe
4tbe�����=}�T�P��~EX���@zFe,����G	�8�T�V�]�n�	"��X���!�r�O7	��U�\�u�1����E9ih��:��X�b�!W�l�q��8�U"��SMc��=w�ETWU0d���p�:���[����<�
���n��
��lZa�g%r�(��a��]�t�p�P��]�, r�:�3���,�F6w]P���s��]���9����`���j��o�
 �8 �kPAp�>	A*$8�T�8���/����#��`���A����j8mB_<		�V���`���
l��
:�P
�  ����#�cI"a�� ���j�p��B��9��S�;lX�BL���ZF4L�(��� 9�"��;��(�*�s 8�`�� x��@}��ZU��8�D��������2�MF*=5*e�:d`(<!�9�#�7�5V�"@�h���Z]��]�5�`�uB t�������CN	i�RQ�K�RmH��P���b����& H�	�%hXu}���?�����V2�b���CU�jp"�ADAS#@��6$l
�]�y����$X ������"p���7v��k���	��8�
Vf�e�_�9f�g��f�o�9��� Ag�1���e)� O�
6��
���h�Z���n����0��eU_����6���F;m��f�������N:A �����Nr,��Y��E�:	���g1 ���;7���`�$��ZX���Z.�b��tHs)eez���RJQE�n�����m��AN
3������ pd
(�@X0cW�����_pw��0a��d�� K����H4��Cl�}�G����HM=[l S�Q�W�K���=t����"����!�u0���:B-d�uAI���/Dt�s�,J!
��D2�A�Q��i*`��DU��A�CA��!� *8��P!��B$|��~�&� Q�:��Y�b�D���OU�qE�f����5
�(
#����g�wH�Y�� @�Bh�=�P��2�M�S��G�I"I���!:�H�
���+�%9�Z�v�p��Ef�8�N)")-���#���k�����j�p��X�����A�=��#s�()�9L�q'.DOA�����+���8�F� F$��D4(
@��	���@��/iN�4�C"K�@����t�*	C��}�9I�!�lI��C���Pa�((����N�B�/i&����I4�O���g��I)�D�W�+(��@h��H�k���h�K=:�^Rr�,L�:Q���_ki@b���AfN��D��CG��FYB��JM:�WhA��y\:����1�������A�K�
 ��Uz��1	-2J�V�K^@�M-�R�l_���Pd���������@�����0E}�Q�������#Y�Cl����14���
��Kt�{Yjmi!S��i�FD�/&j���������a���!eK�Y�bEX����A���<^�W�+��BGI �#P�Tj�Xc��B|w��t���{FC��@0Z�
�j��]�P�(���>���IB
#9���m��
i������~;R��j��_1l�XA���v�,���9QM������q����H�%���Du[!�����<1��4��7$�uNn��8H��>yD����\��c�.+^V���y���$#��5Ps�ge �d�mlQ$����j(�����g�&�&4�Bc���4[|�sv�X��n����*i1
+O���#@�@
Y�I��>N�6�jV�� ��}0�"QU��B�gC�L�����M,@@ ����t= �����v�9����%8�E]���4pC�����@�eK���	~W������9P��x��
������O��	l�d
����.�V�VC\i���p�C�H8�u=8NJ�g"�Ri�2����B�=Z�`Y��-�p7��G��e�b�$U��a������|����I�V�l
�!�R�}�tb���
�!0��^;%��f���_7c�M�����0��%0X���q$,h�\]�`�{1����q%�(�W~Z�b�0$��W�=�y���!:�D�����������h�%"��fE�������c������D�B^g!����O�sl��`����S-	����E�}�M��K�D���������k������e��,m^����BM;D���������ja��;�VC�� �$�[?��H�ZW�nf�����0���6*<��0����2C,���[:T���	���0�/�c���I��,��0�J�V4�@@����oK�H���3�L@�9(���p�^2���.K��<1�3p�Hx@4��MH?���3��AB�?�������������()������+����,T���.q�?��Uk��)�*�����S�<7-�-7��
I����"��L����7V��#=-�;���>�?K����+D��;� ;�C�{B�[���i�M����@���C���<E����h#���KF������E	lAp">7����@�dD�H9�����m�7|������dAV(�,���
�;G�H=��k��y������s��A|i��h��������3 ������~����@l�f\ V�5�b&J��
F��n��Y��Z(�w�����fa������(�.�"�����|\��(�5������������H���������;�����<������2;S�J��N�J�B�I�3=���h������\.�����J�� �/�)-��,0|	A���cY�G��Z�5��/�$��dL��@L�7����.�l=���
�e����L��� RI������\��1�
(�d�Y��V���M����-�;M���XH�H��lN �'�#��1��4$!��4��L��;/$����Y��|��!]�Di�DA_�4D��*������KZ���#�`�^Y�uYO�4�����G���� ���P}ODIPv]5����x����,��	]�PN u�Z������Q�q����p�� �����y��)�!e�&u�'��(���)����(�+-�N�y��ZH���I�U���N]��8
��8[��Z8���M�����<��a�'1J7����N�8
�P��DT7�!����XS�8�I�T��K�V�����^��S57
�����D�X�6�T�V�x<Y��V�������
��a��U���h\�UbeV%��99�k�i�1�� �t��m�/��<�����g��r--F���Q<su�������.��w��x��� �p���~����O��0�K5�aZIBp~=��E#��VW02��X����N����X������G��5���4O���u�o��EL<���}���CPX��WhS�Z(�$P�Ap�����P�V��ZAW��:�����^M�������cE�.�:�5����FC����u[� !`%�������<�Gp����e��h�?@H�5\�h��-�R��=����|E�f���F �����M���ECH�������
��Y�E��:�EB��u���D���*�\��\ X�0����M7�5'���5�������ll��E�S�$&X[`^�u�9�/Fp!/���-:[�� ��^���7�3"(��%���:W`Y�_"�3��+���e:;�q�n�Y� ��S�;�n�*���#���O��V
��Yx���$��?�Tn]E&h	���JP��5�&0PUL�����V���>��D��p �����@��i���Mbj�B@�X���.�)��=�� 	����NX���J����������C�b5��	�Z��?�OCi�����Q�D�����MH�=�2 1�?6�t�����5#.TX��0*HaI�1N��Z@e�P
EbaFP��`��-SF1J.cNN���C����F(�Sp ��Y�G^�M��V�e5[�_N�����35��)S�1�JE*���1
� ���0R�!�����A�|��}��~��n��eUN�>5�a@���y����Y�+���<����k��E%��2Fh I��0���hh3gJ���H0i�0��>�"���
	.P�	������E�W��H���5V+��;��Vx�-0��6�*�X���Q�h��F����5���~�JSP����kb}�B��	���;L�Y�%���j���}k�
����H	�O�"-���p�4a��6XxA�Z��~��}:�2�l����"������l}J�:���T�0�ZH�C�:����,���bm��2G��9�f���2��4��R���m����@���$!�Y��������Z����������n�5VE����:*0�P��l�N�f�f&�GM-�i���=�'���f���R�v��m7X���Z��P��v0��{MfD��ZhH���F�I�!�w����"��U�$�J��qbWG�3�������`�	;pW����#� �11:���*#/�����0�V���+S�'@o)�V��?�������w���E������5_�r����U �3� ��,�@�:���z P��ox` ���@^���6B�d�"`�Z����� K��cK�;w��A����X�Um����6u�<�f�Y��F��������Y����u]EE`2����~4��	��"_v��h���4��BX�XH	P��r�L��]�0��/����,�N�u�U1��S�#�V/�A��	��C��	Mj+#��s)Q����.�U����G����,�����}�*0M���YpqF)Z �z�.-H9�4y�>tD��=��Z	��)�����GTL/�D��3(��!��>(����$zY]<w�a��.���(R����]�u 0���I!��p�h�/{���5��[������g(Xf�@�&�&P{5�Y������WQ�w�4����+R�����{��JpR������*�����I��*��Zh{����'L��1Hl��Q��;�}�}-�G��o�>���g�5�_�|���2���Q�u�?J.0v�7�A�F��0��b��VZ �����}�/WA0�(e���'nb��j b�GU5�1�,h� ��
2l��!��'R�h�"��7r���#��"ERX0���*W�l��%��2g��i�#�K����'��B�-j�a��&�2m��)��R�n��4����r����+��b��-k�,��j��m��-��r���k�.^�Tv�iV�{.l�0��)0A1���'S�l�2���7s���3���G�.m�4���W�n��5l��2��m�cm��h�P@�X� cP�Ay,�>	]��	��z �����7'n|b���k�_���
<�|C����@�C��q���^��
���
�C�q7,
��B�
�~���F�]p�-�]-�X����G�����P��W�z�������h��3B6�d4����@Vq`�,���E���i�AXF���L�eDV���HI�Cd���hF9%D[��� �@d*dg-h�9%p��DX���Q����(���h-�2�$�
�4\j��K>�$-q�	�
�*&Ad9jB�N�',����BO�����f�i�J���*J���u0�U "O�!\������E�F;��U� 
TnD�y���FD��@������J-���2�o�-����YE�C@;P	�����(|�M���\v���J��pV��P��T[��0Cd�/�&p��r�	���BC?����<'�b�jI ���1j��e�ur/AaGtu��2u�9���lK��IY�l6Ct$7c���.C_��
�@mr7$x�c���CcS	9�A� �.��m�E�@t}��
��3��Bd�d�����CS�����"ze�gD�fd��SDF�t+FcT���/|��F���B��
��'�
�}�\��T��~�-��B�? @�  ���7y��O:D(0���w
�^A�G��c!��Z�!�;�<��!/�#H�2�Z�����L+��p�w���(�
hp" E*wC������DfQR��1l��v��"���T���0�
a��T��q��.["��?.i�CYIaBzX��.�,�� ;32D�?$
�����*b��X
 `�P�Bi ���l��L�*R�I�~C��*"�)������Y��\�
);1	$����%�&��0���,/�I�q�/p2�@R��mR�f �`��q�iyR�+cy�]&Do�tE'�9�6d��7
�J�M�8����+r^���lH.;�I_���T%C�y	l{	��>g���3����_,���8`9�� E�����8��83���N���	��:�"�V���DP9vhe�$H
Pq��i�q��D���X*Q�����=�`8�����Ne�GJ���C�)T���>�E(e�L��N�4J�(�pjR$�G�F�m�*��N��t�+^�������~�+`+�����=,b���2���},d#+��R����,f3���r����,hC+������=-jS�������}-lc+�������-ns��������-p�+��$ ;
simple_select3.sqlapplication/sql; name=simple_select3.sqlDownload
#49Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#48)
Re: BufferAlloc: don't take two simultaneous locks

Hi, Yura.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrot
e in

Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

Thanks for the new version and benchmarking.

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves etry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

StrategyGetBuffer believes that entries don't move across freelists
and it was true before this patch.

Second, I confirm there is problem with freelist spreading.
If I keep entry's freelist_idx, then one freelist is crowded.
If I use new entry's freelist_idx, then one freelist is emptified
constantly.

Perhaps it is what I saw before. I'm not sure about the details of
how that happens, though.

Third, I found increased concurrency could harm. When popular block
is evicted for some reason, then thundering herd effect occures:
many backends wants to read same block, they evict many other
buffers, but only one is inserted. Other goes to freelist. Evicted
buffers by itself reduce cache hit ratio and provocates more
work. Old version resists this effect by not removing old buffer
before new entry is successfully inserted.

Nice finding.

To fix this issues I made following:

# Concurrency

First, I limit concurrency by introducing other lwlocks tranche -
BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
128).
If backend doesn't find buffer in buffer table and wants to introduce
it, it first calls
LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
If lock were acquired, then it goes to eviction and replace process.
Otherwise, it waits lock to be released and repeats search.

This greately improve performance for > 400 clients in pgbench.

So the performance difference between the existing code and v11 is the
latter has a collision cross section eight times smaller than the
former?

+ * Prevent "thundering herd" problem and limit concurrency.

this is something like pressing accelerator and break pedals at the
same time. If it improves performance, just increasing the number of
buffer partition seems to work?

It's also not great that follower backends runs a busy loop on the
lock until the top-runner backend inserts the new buffer to the
buftable then releases the newParititionLock.

I tried other variant as well:
- first insert entry with dummy buffer index into buffer table.
- if such entry were already here, then wait it to be filled.
- otherwise find victim buffer and replace dummy index with new one.
Wait were done with shared lock on EvictPartitionLock as well.
This variant performed quite same.

This one looks better to me. Since a partition can be shared by two or
more new-buffers, condition variable seems to work better here...

Logically I like that variant more, but there is one gotcha:
FlushBuffer could fail with elog(ERROR). Therefore then there is
a need to reliable remove entry with dummy index.

Perhaps UnlockBuffers can do that.

And after all, I still need to hold EvictPartitionLock to notice
waiters.
I've tried to use ConditionalVariable, but its performance were much
worse.

How many CVs did you use?

# Dynahash capacity and freelists.

I returned back buffer table initialization:
- removed FIXES_SIZE restriction introduced in previous version

Mmm. I don't see v10 in this list and v9 doesn't contain FIXES_SIZE..

- returned `NBuffers + NUM_BUFFER_PARTITIONS`.
I really think, there should be more spare items, since almost always
entry_alloc is called at least once (on 128MB shared_buffers). But
let keep it as is for now.

Maybe s/entry_alloc/element_alloc/ ? :p

I see it with shared_buffers=128kB (not MB) and pgbench -i on master.

The required number of elements are already allocaed to freelists at
hash creation. So the reason for the call is imbalanced use among
freelists. Even in that case other freelists holds elements. So we
don't need to expand the element size.

`get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
before falling back to `entry_alloc`, and probing is changed from
linear to quadratic. This greately reduces number of calls to
`entry_alloc`, so more shared memory left intact. And I didn't notice
large performance hit from. Probably there is some, but I think it is
adequate trade-off.

I don't think that causes significant performance hit, but I don't
understand how it improves freelist hit ratio other than by accident.
Could you have some reasoning for it?

By the way the change of get_hash_entry looks something wrong.

If I understand it correctly, it visits num_freelists/4 freelists at
once, then tries element_alloc. If element_alloc() fails (that must
happen), it only tries freeList[freelist_idx] and gives up, even
though there must be an element in other 3/4 freelists.

`free_reused_entry` now returns entry to random position. It flattens
free entry's spread. Although it is not enough without other changes
(thundering herd mitigation and probing more lists in get_hash_entry).

If "thudering herd" means "many backends rush trying to read-in the
same page at once", isn't it avoided by the change in BufferAlloc?

I feel the random returning method might work. I want to get rid of
the randomness here but I don't come up with a better way.

Anyway the code path is used only by buftable so it doesn't harm
generally.

# Benchmarks

# Thanks for benchmarking!!

Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
18 cores per socket + hyper-threading - upto 72 virtual core total.
turbo-boost disabled
Linux 5.10.103-1 Debian.

pgbench scale 100 simple_select + simple select with 3 keys (sql file
attached).

shared buffers 128MB & 1GB
huge_pages=on

1 socket
conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 27882 | 27738 | 32735 | 32439
2 | 54082 | 54336 | 64387 | 63846
3 | 80724 | 81079 | 96387 | 94439
5 | 134404 | 133429 | 160085 | 157399
7 | 185977 | 184502 | 219916 | 217142

v11+128MB degrades above here..

17 | 335345 | 338214 | 393112 | 388796
27 | 393686 | 394948 | 447945 | 444915
53 | 572234 | 577092 | 678884 | 676493
83 | 558875 | 561689 | 669212 | 655697
107 | 553054 | 551896 | 654550 | 646010
139 | 541263 | 538354 | 641937 | 633840
163 | 532932 | 531829 | 635127 | 627600
191 | 524647 | 524442 | 626228 | 617347
211 | 521624 | 522197 | 629740 | 613143

v11+1GB degrades above here..

239 | 509448 | 554894 | 652353 | 652972
271 | 468190 | 557467 | 647403 | 661348
307 | 454139 | 558694 | 642229 | 657649
353 | 446853 | 554301 | 635991 | 654571
397 | 441909 | 549822 | 625194 | 647973

1 socket 3 keys

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 16677 | 16477 | 22219 | 22030
2 | 32056 | 31874 | 43298 | 43153
3 | 48091 | 47766 | 64877 | 64600
5 | 78999 | 78609 | 105433 | 106101
7 | 108122 | 107529 | 148713 | 145343

v11+128MB degrades above here..

17 | 205656 | 209010 | 272676 | 271449
27 | 252015 | 254000 | 323983 | 323499

v11+1GB degrades above here..

53 | 317928 | 334493 | 446740 | 449641
83 | 299234 | 327738 | 437035 | 443113
107 | 290089 | 322025 | 430535 | 431530
139 | 277294 | 314384 | 422076 | 423606
163 | 269029 | 310114 | 416229 | 417412
191 | 257315 | 306530 | 408487 | 416170
211 | 249743 | 304278 | 404766 | 416393
239 | 243333 | 310974 | 397139 | 428167
271 | 236356 | 309215 | 389972 | 427498
307 | 229094 | 307519 | 382444 | 425891
353 | 224385 | 305366 | 375020 | 423284
397 | 218549 | 302577 | 364373 | 420846

2 sockets

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 27287 | 27631 | 32943 | 32493
2 | 52397 | 54011 | 64572 | 63596
3 | 76157 | 80473 | 93363 | 93528
5 | 127075 | 134310 | 153176 | 149984
7 | 177100 | 176939 | 216356 | 211599
17 | 379047 | 383179 | 464249 | 470351
27 | 545219 | 546706 | 664779 | 662488
53 | 728142 | 728123 | 857454 | 869407
83 | 918276 | 957722 | 1215252 | 1203443

v11+1GB degrades above here..

107 | 884112 | 971797 | 1206930 | 1234606
139 | 822564 | 970920 | 1167518 | 1233230
163 | 788287 | 968248 | 1130021 | 1229250
191 | 772406 | 959344 | 1097842 | 1218541
211 | 756085 | 955563 | 1077747 | 1209489
239 | 732926 | 948855 | 1050096 | 1200878
271 | 692999 | 941722 | 1017489 | 1194012
307 | 668241 | 920478 | 994420 | 1179507
353 | 642478 | 908645 | 968648 | 1174265
397 | 617673 | 893568 | 950736 | 1173411

2 sockets 3 keys

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 16722 | 16393 | 20340 | 21813
2 | 32057 | 32009 | 39993 | 42959
3 | 46202 | 47678 | 59216 | 64374
5 | 78882 | 72002 | 98054 | 103731
7 | 103398 | 99538 | 135098 | 135828

v11+128MB degrades above here..

17 | 205863 | 217781 | 293958 | 299690
27 | 283526 | 290539 | 414968 | 411219
53 | 336717 | 356130 | 460596 | 474563
83 | 307310 | 342125 | 419941 | 469989
107 | 294059 | 333494 | 405706 | 469593
139 | 278453 | 328031 | 390984 | 470553
163 | 270833 | 326457 | 384747 | 470977
191 | 259591 | 322590 | 376582 | 470335
211 | 263584 | 321263 | 375969 | 469443
239 | 257135 | 316959 | 370108 | 470904
271 | 251107 | 315393 | 365794 | 469517
307 | 246605 | 311585 | 360742 | 467566
353 | 236899 | 308581 | 353464 | 466936
397 | 249036 | 305042 | 344673 | 466842

I skipped v10 since I used it internally for variant
"insert entry with dummy index then search victim".

Up to about 15%(?) of gain is great.
I'm not sure it is okay that it seems to slow by about 1%..

Ah, I see.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#50Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#49)
Re: BufferAlloc: don't take two simultaneous locks

В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:

Hi, Yura.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrot
e in

Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

Thanks for the new version and benchmarking.

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves etry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

StrategyGetBuffer believes that entries don't move across freelists
and it was true before this patch.

StrategyGetBuffer knows nothing about dynahash's freelist.
It knows about buffer manager's freelist, which is not partitioned.

Second, I confirm there is problem with freelist spreading.
If I keep entry's freelist_idx, then one freelist is crowded.
If I use new entry's freelist_idx, then one freelist is emptified
constantly.

Perhaps it is what I saw before. I'm not sure about the details of
how that happens, though.

Third, I found increased concurrency could harm. When popular block
is evicted for some reason, then thundering herd effect occures:
many backends wants to read same block, they evict many other
buffers, but only one is inserted. Other goes to freelist. Evicted
buffers by itself reduce cache hit ratio and provocates more
work. Old version resists this effect by not removing old buffer
before new entry is successfully inserted.

Nice finding.

To fix this issues I made following:

# Concurrency

First, I limit concurrency by introducing other lwlocks tranche -
BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
128).
If backend doesn't find buffer in buffer table and wants to introduce
it, it first calls
LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
If lock were acquired, then it goes to eviction and replace process.
Otherwise, it waits lock to be released and repeats search.

This greately improve performance for > 400 clients in pgbench.

So the performance difference between the existing code and v11 is the
latter has a collision cross section eight times smaller than the
former?

No. Acquiring EvictPartitionLock
1. doesn't block readers, since readers doesn't acquire EvictPartitionLock
2. doesn't form "tree of lock dependency" since EvictPartitionLock is
independent from PartitionLock.

Problem with existing code:
1. Process A locks P1 and P2
2. Process B (p3-old, p1-new) locks P3 and wants to lock P1
3. Process C (p4-new, p1-old) locks P4 and wants to lock P1
4. Process D (p5-new, p4-old) locks P5 and wants to lock P4
At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting
for Process A.
And readers can't read from same five partitions.

With new code:
1. Process A locks E1 (evict partition) and locks P2,
then releases P2 and locks P1.
2. Process B tries to locks E1, waits and retries search.
3. Process C locks E4, locks P1, then releases P1 and locks P4
4. Process D locks E5, locks P4, then releases P4 and locks P5
So, there is no network of locks.
Process A doesn't block Process D in any moment:
- either A blocks C, but C doesn't block D at this moment
- or A doesn't block C.
And readers doesn't see simultaneously locked five locks which
depends on single Process A.

+ * Prevent "thundering herd" problem and limit concurrency.

this is something like pressing accelerator and break pedals at the
same time. If it improves performance, just increasing the number of
buffer partition seems to work?

To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS
does improve average case.
But it is better to cure problem than anesthetize.
Increase of
NUM_BUFFER_PARTITIONS reduces probability and relative
weight of lock network, but doesn't eliminate.

It's also not great that follower backends runs a busy loop on the
lock until the top-runner backend inserts the new buffer to the
buftable then releases the newParititionLock.

I tried other variant as well:
- first insert entry with dummy buffer index into buffer table.
- if such entry were already here, then wait it to be filled.
- otherwise find victim buffer and replace dummy index with new one.
Wait were done with shared lock on EvictPartitionLock as well.
This variant performed quite same.

This one looks better to me. Since a partition can be shared by two or
more new-buffers, condition variable seems to work better here...

Logically I like that variant more, but there is one gotcha:
FlushBuffer could fail with elog(ERROR). Therefore then there is
a need to reliable remove entry with dummy index.

Perhaps UnlockBuffers can do that.

Thanks for suggestion. I'll try to investigate and retry this way
of patch.

And after all, I still need to hold EvictPartitionLock to notice
waiters.
I've tried to use ConditionalVariable, but its performance were much
worse.

How many CVs did you use?

I've tried both NUM_PARTITION_LOCKS and NUM_PARTITION_LOCKS*8.
It doesn't matter.
Looks like use of WaitLatch (which uses epoll) and/or tripple
SpinLockAcquire per good case (with two list traversing) is much worse
than PgSemaphorLock (which uses futex) and single wait list action.

Other probability is while ConditionVariable eliminates thundering
nerd effect, it doesn't limit concurrency enough... but that's just
theory.

In reality, I'd like to try to make BufferLookupEnt->id to be atomic
and add LwLock to BufferLookupEnt. I'll test it, but doubt it could
be merged, since there is no way to initialize dynahash's entries
reliably.

# Dynahash capacity and freelists.

I returned back buffer table initialization:
- removed FIXES_SIZE restriction introduced in previous version

Mmm. I don't see v10 in this list and v9 doesn't contain FIXES_SIZE..

v9 contains HASH_FIXED_SIZE - line 815 of patch, PATCH 3/4 "fixed BufTable".

- returned `NBuffers + NUM_BUFFER_PARTITIONS`.
I really think, there should be more spare items, since almost always
entry_alloc is called at least once (on 128MB shared_buffers). But
let keep it as is for now.

Maybe s/entry_alloc/element_alloc/ ? :p

:p yes

I see it with shared_buffers=128kB (not MB) and pgbench -i on master.

The required number of elements are already allocaed to freelists at
hash creation. So the reason for the call is imbalanced use among
freelists. Even in that case other freelists holds elements. So we
don't need to expand the element size.

`get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
before falling back to `entry_alloc`, and probing is changed from
linear to quadratic. This greately reduces number of calls to
`entry_alloc`, so more shared memory left intact. And I didn't notice
large performance hit from. Probably there is some, but I think it is
adequate trade-off.

I don't think that causes significant performance hit, but I don't
understand how it improves freelist hit ratio other than by accident.
Could you have some reasoning for it?

Since free_reused_entry returns entry into random free_list, this
probability is quite high. In tests, I see stabilisa

By the way the change of get_hash_entry looks something wrong.

If I understand it correctly, it visits num_freelists/4 freelists at
once, then tries element_alloc. If element_alloc() fails (that must
happen), it only tries freeList[freelist_idx] and gives up, even
though there must be an element in other 3/4 freelists.

No. If element_alloc fails, it tries all NUM_FREELISTS again.
- condition: `ntries || !allocFailed`. `!allocFailed` become true,
so `ntries` remains.
- `ntries = num_freelists;` regardless of `allocFailed`.
Therefore, all `NUM_FREELISTS` are retried for partitioned table.

`free_reused_entry` now returns entry to random position. It flattens
free entry's spread. Although it is not enough without other changes
(thundering herd mitigation and probing more lists in get_hash_entry).

If "thudering herd" means "many backends rush trying to read-in the
same page at once", isn't it avoided by the change in BufferAlloc?

"thundering herd" reduces speed of entries migration a lot. But
`simple_select` benchmark is too biased: looks like btree root is
evicted from time to time. So entries are slowly migrated to of from
freelist of its partition.
Without "thundering herd" fix this migration is very fast.

I feel the random returning method might work. I want to get rid of
the randomness here but I don't come up with a better way.

Anyway the code path is used only by buftable so it doesn't harm
generally.

# Benchmarks

# Thanks for benchmarking!!

Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
18 cores per socket + hyper-threading - upto 72 virtual core total.
turbo-boost disabled
Linux 5.10.103-1 Debian.

pgbench scale 100 simple_select + simple select with 3 keys (sql file
attached).

shared buffers 128MB & 1GB
huge_pages=on

1 socket
conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 27882 | 27738 | 32735 | 32439
2 | 54082 | 54336 | 64387 | 63846
3 | 80724 | 81079 | 96387 | 94439
5 | 134404 | 133429 | 160085 | 157399
7 | 185977 | 184502 | 219916 | 217142

v11+128MB degrades above here..

+ 1GB?

17 | 335345 | 338214 | 393112 | 388796
27 | 393686 | 394948 | 447945 | 444915
53 | 572234 | 577092 | 678884 | 676493
83 | 558875 | 561689 | 669212 | 655697
107 | 553054 | 551896 | 654550 | 646010
139 | 541263 | 538354 | 641937 | 633840
163 | 532932 | 531829 | 635127 | 627600
191 | 524647 | 524442 | 626228 | 617347
211 | 521624 | 522197 | 629740 | 613143

v11+1GB degrades above here..

239 | 509448 | 554894 | 652353 | 652972
271 | 468190 | 557467 | 647403 | 661348
307 | 454139 | 558694 | 642229 | 657649
353 | 446853 | 554301 | 635991 | 654571
397 | 441909 | 549822 | 625194 | 647973

1 socket 3 keys

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 16677 | 16477 | 22219 | 22030
2 | 32056 | 31874 | 43298 | 43153
3 | 48091 | 47766 | 64877 | 64600
5 | 78999 | 78609 | 105433 | 106101
7 | 108122 | 107529 | 148713 | 145343

v11+128MB degrades above here..

17 | 205656 | 209010 | 272676 | 271449
27 | 252015 | 254000 | 323983 | 323499

v11+1GB degrades above here..

53 | 317928 | 334493 | 446740 | 449641
83 | 299234 | 327738 | 437035 | 443113
107 | 290089 | 322025 | 430535 | 431530
139 | 277294 | 314384 | 422076 | 423606
163 | 269029 | 310114 | 416229 | 417412
191 | 257315 | 306530 | 408487 | 416170
211 | 249743 | 304278 | 404766 | 416393
239 | 243333 | 310974 | 397139 | 428167
271 | 236356 | 309215 | 389972 | 427498
307 | 229094 | 307519 | 382444 | 425891
353 | 224385 | 305366 | 375020 | 423284
397 | 218549 | 302577 | 364373 | 420846

2 sockets

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 27287 | 27631 | 32943 | 32493
2 | 52397 | 54011 | 64572 | 63596
3 | 76157 | 80473 | 93363 | 93528
5 | 127075 | 134310 | 153176 | 149984
7 | 177100 | 176939 | 216356 | 211599
17 | 379047 | 383179 | 464249 | 470351
27 | 545219 | 546706 | 664779 | 662488
53 | 728142 | 728123 | 857454 | 869407
83 | 918276 | 957722 | 1215252 | 1203443

v11+1GB degrades above here..

107 | 884112 | 971797 | 1206930 | 1234606
139 | 822564 | 970920 | 1167518 | 1233230
163 | 788287 | 968248 | 1130021 | 1229250
191 | 772406 | 959344 | 1097842 | 1218541
211 | 756085 | 955563 | 1077747 | 1209489
239 | 732926 | 948855 | 1050096 | 1200878
271 | 692999 | 941722 | 1017489 | 1194012
307 | 668241 | 920478 | 994420 | 1179507
353 | 642478 | 908645 | 968648 | 1174265
397 | 617673 | 893568 | 950736 | 1173411

2 sockets 3 keys

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 16722 | 16393 | 20340 | 21813
2 | 32057 | 32009 | 39993 | 42959
3 | 46202 | 47678 | 59216 | 64374
5 | 78882 | 72002 | 98054 | 103731
7 | 103398 | 99538 | 135098 | 135828

v11+128MB degrades above here..

17 | 205863 | 217781 | 293958 | 299690
27 | 283526 | 290539 | 414968 | 411219
53 | 336717 | 356130 | 460596 | 474563
83 | 307310 | 342125 | 419941 | 469989
107 | 294059 | 333494 | 405706 | 469593
139 | 278453 | 328031 | 390984 | 470553
163 | 270833 | 326457 | 384747 | 470977
191 | 259591 | 322590 | 376582 | 470335
211 | 263584 | 321263 | 375969 | 469443
239 | 257135 | 316959 | 370108 | 470904
271 | 251107 | 315393 | 365794 | 469517
307 | 246605 | 311585 | 360742 | 467566
353 | 236899 | 308581 | 353464 | 466936
397 | 249036 | 305042 | 344673 | 466842

I skipped v10 since I used it internally for variant
"insert entry with dummy index then search victim".

Up to about 15%(?) of gain is great.

Up to 35% in "2 socket 3 key 1GB" case.
Up to 44% in "2 socket 1 key 128MB" case.

I'm not sure it is okay that it seems to slow by about 1%..

Well, in fact some degradation is not reproducible.
Surprisingly, results change a bit from time to time.
I just didn't rerun whole `master` branch bench again
after v11 bench, since each whole test run costs me 1.5 hour.

But I confirm regression on "1 socket 1 key 1GB" test case
between 83 and 211 connections. It were reproducible on
more powerful Xeon 8354H, although it were less visible.

Other fluctuations close to 1% are not reliable.
For example, sometimes I see degradation or improvement with
2GB shared buffers (and even more than 1%). But 2GB is enough
for whole test dataset (scale 100 pgbench is 1.5GB on disk).
Therefore modified code is not involved in benchmarking at all.
How it could be explained?
That is why I don't post 2GB benchmark results. (yeah, I'm
cheating a bit).

Show quoted text

Ah, I see.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#51Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yura Sokolov (#50)
Re: BufferAlloc: don't take two simultaneous locks

At Thu, 07 Apr 2022 14:14:59 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:

Hi, Yura.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrot
e in

Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

Thanks for the new version and benchmarking.

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves etry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

StrategyGetBuffer believes that entries don't move across freelists
and it was true before this patch.

StrategyGetBuffer knows nothing about dynahash's freelist.
It knows about buffer manager's freelist, which is not partitioned.

Yeah, right. I meant get_hash_entry.

To fix this issues I made following:

# Concurrency

First, I limit concurrency by introducing other lwlocks tranche -
BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
128).
If backend doesn't find buffer in buffer table and wants to introduce
it, it first calls
LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
If lock were acquired, then it goes to eviction and replace process.
Otherwise, it waits lock to be released and repeats search.

This greately improve performance for > 400 clients in pgbench.

So the performance difference between the existing code and v11 is the
latter has a collision cross section eight times smaller than the
former?

No. Acquiring EvictPartitionLock
1. doesn't block readers, since readers doesn't acquire EvictPartitionLock
2. doesn't form "tree of lock dependency" since EvictPartitionLock is
independent from PartitionLock.

Problem with existing code:
1. Process A locks P1 and P2
2. Process B (p3-old, p1-new) locks P3 and wants to lock P1
3. Process C (p4-new, p1-old) locks P4 and wants to lock P1
4. Process D (p5-new, p4-old) locks P5 and wants to lock P4
At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting
for Process A.
And readers can't read from same five partitions.

With new code:
1. Process A locks E1 (evict partition) and locks P2,
then releases P2 and locks P1.
2. Process B tries to locks E1, waits and retries search.
3. Process C locks E4, locks P1, then releases P1 and locks P4
4. Process D locks E5, locks P4, then releases P4 and locks P5
So, there is no network of locks.
Process A doesn't block Process D in any moment:
- either A blocks C, but C doesn't block D at this moment
- or A doesn't block C.
And readers doesn't see simultaneously locked five locks which
depends on single Process A.

Thansk for the detailed explanation. I see that.

+ * Prevent "thundering herd" problem and limit concurrency.

this is something like pressing accelerator and break pedals at the
same time. If it improves performance, just increasing the number of
buffer partition seems to work?

To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS
does improve average case.
But it is better to cure problem than anesthetize.
Increase of
NUM_BUFFER_PARTITIONS reduces probability and relative
weight of lock network, but doesn't eliminate.

Agreed.

It's also not great that follower backends runs a busy loop on the
lock until the top-runner backend inserts the new buffer to the
buftable then releases the newParititionLock.

I tried other variant as well:
- first insert entry with dummy buffer index into buffer table.
- if such entry were already here, then wait it to be filled.
- otherwise find victim buffer and replace dummy index with new one.
Wait were done with shared lock on EvictPartitionLock as well.
This variant performed quite same.

This one looks better to me. Since a partition can be shared by two or
more new-buffers, condition variable seems to work better here...

Logically I like that variant more, but there is one gotcha:
FlushBuffer could fail with elog(ERROR). Therefore then there is
a need to reliable remove entry with dummy index.

Perhaps UnlockBuffers can do that.

Thanks for suggestion. I'll try to investigate and retry this way
of patch.

And after all, I still need to hold EvictPartitionLock to notice
waiters.
I've tried to use ConditionalVariable, but its performance were much
worse.

How many CVs did you use?

I've tried both NUM_PARTITION_LOCKS and NUM_PARTITION_LOCKS*8.
It doesn't matter.
Looks like use of WaitLatch (which uses epoll) and/or tripple
SpinLockAcquire per good case (with two list traversing) is much worse
than PgSemaphorLock (which uses futex) and single wait list action.

Sure. I unintentionally neglected the overhead of our CV
implementation. It cannot be used in such a hot path.

Other probability is while ConditionVariable eliminates thundering
nerd effect, it doesn't limit concurrency enough... but that's just
theory.

In reality, I'd like to try to make BufferLookupEnt->id to be atomic
and add LwLock to BufferLookupEnt. I'll test it, but doubt it could
be merged, since there is no way to initialize dynahash's entries
reliably.

Yeah, that's what came to my mind first (but with not a LWLock but a
CV) but gave up for the reason of additional size. The size of
BufferLookupEnt is 24 and sizeof(ConditionVariable) is 12. By the way
sizeof(LWLock) is 16.. So I think we don't take the per-bufentry
approach here for the reason of additional memory usage.

I don't think that causes significant performance hit, but I don't
understand how it improves freelist hit ratio other than by accident.
Could you have some reasoning for it?

Since free_reused_entry returns entry into random free_list, this
probability is quite high. In tests, I see stabilisa

Maybe. Doesn't it improve the efficiency if we prioritize emptied
freelist on returning an element? I tried it with an atomic_u32 to
remember empty freelist. On the uin32, each bit represents a freelist
index. I saw it eliminated calls to element_alloc. I tried to
remember a single freelist index in an atomic but there was a case
where two freelists are emptied at once and that lead to element_alloc
call.

By the way the change of get_hash_entry looks something wrong.

If I understand it correctly, it visits num_freelists/4 freelists at
once, then tries element_alloc. If element_alloc() fails (that must
happen), it only tries freeList[freelist_idx] and gives up, even
though there must be an element in other 3/4 freelists.

No. If element_alloc fails, it tries all NUM_FREELISTS again.
- condition: `ntries || !allocFailed`. `!allocFailed` become true,
so `ntries` remains.
- `ntries = num_freelists;` regardless of `allocFailed`.
Therefore, all `NUM_FREELISTS` are retried for partitioned table.

Ah, okay. ntries is set to num_freelists after calling element_alloc.
I think we (I?) need more comments.

By the way, why it is num_freelists / 4 + 1?

`free_reused_entry` now returns entry to random position. It flattens
free entry's spread. Although it is not enough without other changes
(thundering herd mitigation and probing more lists in get_hash_entry).

If "thudering herd" means "many backends rush trying to read-in the
same page at once", isn't it avoided by the change in BufferAlloc?

"thundering herd" reduces speed of entries migration a lot. But
`simple_select` benchmark is too biased: looks like btree root is
evicted from time to time. So entries are slowly migrated to of from
freelist of its partition.
Without "thundering herd" fix this migration is very fast.

Ah, that observation agree with the seemingly unidirectional migration
of free entries.

I remember that it is raised in this list several times to prioritize
index pages in shared buffers..

Up to about 15%(?) of gain is great.

Up to 35% in "2 socket 3 key 1GB" case.
Up to 44% in "2 socket 1 key 128MB" case.

Oh, more great!

I'm not sure it is okay that it seems to slow by about 1%..

Well, in fact some degradation is not reproducible.
Surprisingly, results change a bit from time to time.

Yeah.

I just didn't rerun whole `master` branch bench again
after v11 bench, since each whole test run costs me 1.5 hour.

Thans for the labor.

But I confirm regression on "1 socket 1 key 1GB" test case
between 83 and 211 connections. It were reproducible on
more powerful Xeon 8354H, although it were less visible.

Other fluctuations close to 1% are not reliable.

I'm glad to hear that. It is not surprising that some fluctuation
happens.

For example, sometimes I see degradation or improvement with
2GB shared buffers (and even more than 1%). But 2GB is enough
for whole test dataset (scale 100 pgbench is 1.5GB on disk).
Therefore modified code is not involved in benchmarking at all.
How it could be explained?
That is why I don't post 2GB benchmark results. (yeah, I'm
cheating a bit).

If buffer replacement doesn't happen, theoretically this patch cannot
be involved in the fluctuation. I think we can consider it an error.

It might come from placement of other variables. I have somethimes got
annoyed by such small but steady change of performance that persists
until I recompiled the whole tree. But, sorry, I don't have a clear
idea of how such performance shift happens..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#52Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#51)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 08/04/2022 в 16:46 +0900, Kyotaro Horiguchi пишет:

At Thu, 07 Apr 2022 14:14:59 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:

Hi, Yura.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrot
e in

Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

Thanks for the new version and benchmarking.

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves etry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

StrategyGetBuffer believes that entries don't move across freelists
and it was true before this patch.

StrategyGetBuffer knows nothing about dynahash's freelist.
It knows about buffer manager's freelist, which is not partitioned.

Yeah, right. I meant get_hash_entry.

But entries doesn't move.
One backends takes some entry from one freelist, other backend puts
other entry to other freelist.

I don't think that causes significant performance hit, but I don't
understand how it improves freelist hit ratio other than by accident.
Could you have some reasoning for it?

Since free_reused_entry returns entry into random free_list, this
probability is quite high. In tests, I see stabilisa

Maybe. Doesn't it improve the efficiency if we prioritize emptied
freelist on returning an element? I tried it with an atomic_u32 to
remember empty freelist. On the uin32, each bit represents a freelist
index. I saw it eliminated calls to element_alloc. I tried to
remember a single freelist index in an atomic but there was a case
where two freelists are emptied at once and that lead to element_alloc
call.

I thought on bitmask too.
But doesn't it return contention which many freelists were against?
Well, in case there are enough entries to keep it almost always "all
set", it would be immutable.

By the way the change of get_hash_entry looks something wrong.

If I understand it correctly, it visits num_freelists/4 freelists at
once, then tries element_alloc. If element_alloc() fails (that must
happen), it only tries freeList[freelist_idx] and gives up, even
though there must be an element in other 3/4 freelists.

No. If element_alloc fails, it tries all NUM_FREELISTS again.
- condition: `ntries || !allocFailed`. `!allocFailed` become true,
so `ntries` remains.
- `ntries = num_freelists;` regardless of `allocFailed`.
Therefore, all `NUM_FREELISTS` are retried for partitioned table.

Ah, okay. ntries is set to num_freelists after calling element_alloc.
I think we (I?) need more comments.

By the way, why it is num_freelists / 4 + 1?

Well, num_freelists could be 1 or 32.
If num_freelists is 1 then num_freelists / 4 == 0 - not good :-)

------

regards

Yura Sokolov

#53Robert Haas
robertmhaas@gmail.com
In reply to: Yura Sokolov (#48)
Re: BufferAlloc: don't take two simultaneous locks

On Wed, Apr 6, 2022 at 9:17 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I skipped v10 since I used it internally for variant
"insert entry with dummy index then search victim".

Hi,

I think there's a big problem with this patch:

--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
  *
  * Since we can't tolerate running out of lookup table entries, we must be
  * sure to specify an adequate table size here.  The maximum steady-state
- * usage is of course NBuffers entries, but BufferAlloc() tries to insert
- * a new entry before deleting the old.  In principle this could be
- * happening in each partition concurrently, so we could need as many as
- * NBuffers + NUM_BUFFER_PARTITIONS entries.
+ * usage is of course NBuffers entries. But due to concurrent
+ * access to numerous free lists in dynahash we can miss free entry that
+ * moved between free lists. So it is better to have some spare free entries
+ * to reduce probability of entry allocations after server start.
  */
  InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

With the existing system, there is a hard cap on the number of hash
table entries that we can ever need: one per buffer, plus one per
partition to cover the "extra" entries that are needed while changing
buffer tags. With the patch, the number of concurrent buffer tag
changes is no longer limited by NUM_BUFFER_PARTITIONS, because you
release the lock on the old buffer partition before acquiring the lock
on the new partition, and therefore there can be any number of
backends trying to change buffer tags at the same time. But that
means, as the comment implies, that there's no longer a hard cap on
how many hash table entries we might need. I don't think we can just
accept the risk that the hash table might try to allocate after
startup. If it tries, it might fail, because all of the extra shared
memory that we allocate at startup may already have been consumed, and
then somebody's query may randomly error out. That's not OK. It's true
that very few users are likely to be affected, because most people
won't consume the extra shared memory, and of those who do, most won't
hammer the system hard enough to cause an error.

However, I don't see us deciding that it's OK to ship something that
could randomly break just because it won't do so very often.

--
Robert Haas
EDB: http://www.enterprisedb.com

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#53)
Re: BufferAlloc: don't take two simultaneous locks

Robert Haas <robertmhaas@gmail.com> writes:

With the existing system, there is a hard cap on the number of hash
table entries that we can ever need: one per buffer, plus one per
partition to cover the "extra" entries that are needed while changing
buffer tags. With the patch, the number of concurrent buffer tag
changes is no longer limited by NUM_BUFFER_PARTITIONS, because you
release the lock on the old buffer partition before acquiring the lock
on the new partition, and therefore there can be any number of
backends trying to change buffer tags at the same time. But that
means, as the comment implies, that there's no longer a hard cap on
how many hash table entries we might need.

I agree that "just hope it doesn't overflow" is unacceptable.
But couldn't you bound the number of extra entries as MaxBackends?

FWIW, I have extremely strong doubts about whether this patch
is safe at all. This particular problem seems resolvable though.

regards, tom lane

#55Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#54)
Re: BufferAlloc: don't take two simultaneous locks

On Thu, Apr 14, 2022 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree that "just hope it doesn't overflow" is unacceptable.
But couldn't you bound the number of extra entries as MaxBackends?

Yeah, possibly ... as long as it can't happen that an operation still
counts against the limit after it's failed due to an error or
something like that.

FWIW, I have extremely strong doubts about whether this patch
is safe at all. This particular problem seems resolvable though.

Can you be any more specific?

This existing comment is surely in the running for terrible comment of the year:

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.

Anybody with a little bit of C knowledge will have no difficulty
gleaning from the code which follows that we are in fact acquiring
both buffer locks, but whoever wrote this (and I think it was a very
long time ago) did not feel it necessary to explain WHY we will need
to have an exclusive lock on both the old and new mapping partitions,
or more specifically, why we must hold both of those locks
simultaneously. That's unfortunate. It is clear that we need to hold
both locks at some point, just because the hash table is partitioned,
but it is not clear why we need to hold them both simultaneously.

It seems to me that whatever hazards exist must come from the fact
that the operation is no longer fully atomic. The existing code
acquires every relevant lock, then does the work, then releases locks.
Ergo, we don't have to worry about concurrency because there basically
can't be any. Stuff could be happening at the same time in other
partitions that are entirely unrelated to what we're doing, but at the
time we touch the two partitions we care about, we're the only one
touching them. Now, if we do as proposed here, we will acquire one
lock, release it, and then take the other lock, and that means that
some operations could overlap that can't overlap today. Whatever gets
broken must get broken because of that possible overlapping, because
in the absence of concurrency, the end state is the same either way.

So ... how could things get broken by having these operations overlap
each other? The possibility that we might run out of buffer mapping
entries is one concern. I guess there's also the question of whether
the collision handling is adequate: if we fail due to a collision and
handle that by putting the buffer on the free list, is that OK? And
what if we fail midway through and the buffer doesn't end up either on
the free list or in the buffer mapping table? I think maybe that's
impossible, but I'm not 100% sure that it's impossible, and I'm not
sure how bad it would be if it did happen. A permanent "leak" of a
buffer that resulted in it becoming permanently unusable would be bad,
for sure. But all of these issues seem relatively possible to avoid
with sufficiently good coding. My intuition is that the buffer mapping
table size limit is the nastiest of the problems, and if that's
resolvable then I'm not sure what else could be a hard blocker. I'm
not saying there isn't anything, just that I don't know what it might
be.

To put all this another way, suppose that we threw out the way we do
buffer allocation today and always allocated from the freelist. If the
freelist is found to be empty, the backend wanting a buffer has to do
some kind of clock sweep to populate the freelist with >=1 buffers,
and then try again. I don't think that would be performant or fair,
because it would probably happen frequently that a buffer some backend
had just added to the free list got stolen by some other backend, but
I think it would be safe, because we already put buffers on the
freelist when relations or databases are dropped, and we allocate from
there just fine in that case. So then why isn't this safe? It's
functionally the same thing, except we (usually) skip over the
intermediate step of putting the buffer on the freelist and taking it
off again.

--
Robert Haas
EDB: http://www.enterprisedb.com

#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#55)
Re: BufferAlloc: don't take two simultaneous locks

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Apr 14, 2022 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I have extremely strong doubts about whether this patch
is safe at all. This particular problem seems resolvable though.

Can you be any more specific?

This existing comment is surely in the running for terrible comment of the year:

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.

I'm pretty sure that text is mine, and I didn't really think it needed
any additional explanation, because of exactly this:

It seems to me that whatever hazards exist must come from the fact
that the operation is no longer fully atomic.

If it's not atomic, then you have to worry about what happens if you
fail partway through, or somebody else changes relevant state while
you aren't holding the lock. Maybe all those cases can be dealt with,
but it will be significantly more fragile and more complicated (and
therefore slower in isolation) than the current code. Is the gain in
potential concurrency worth it? I didn't think so at the time, and
the graphs upthread aren't doing much to convince me otherwise.

regards, tom lane

#57Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#56)
Re: BufferAlloc: don't take two simultaneous locks

On Thu, Apr 14, 2022 at 11:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If it's not atomic, then you have to worry about what happens if you
fail partway through, or somebody else changes relevant state while
you aren't holding the lock. Maybe all those cases can be dealt with,
but it will be significantly more fragile and more complicated (and
therefore slower in isolation) than the current code. Is the gain in
potential concurrency worth it? I didn't think so at the time, and
the graphs upthread aren't doing much to convince me otherwise.

Those graphs show pretty big improvements. Maybe that's only because
what is being done is not actually safe, but it doesn't look like a
trivial effect.

--
Robert Haas
EDB: http://www.enterprisedb.com

#58Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#53)
Re: BufferAlloc: don't take two simultaneous locks

At Thu, 14 Apr 2022 11:02:33 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

It seems to me that whatever hazards exist must come from the fact
that the operation is no longer fully atomic. The existing code
acquires every relevant lock, then does the work, then releases locks.
Ergo, we don't have to worry about concurrency because there basically
can't be any. Stuff could be happening at the same time in other
partitions that are entirely unrelated to what we're doing, but at the
time we touch the two partitions we care about, we're the only one
touching them. Now, if we do as proposed here, we will acquire one
lock, release it, and then take the other lock, and that means that
some operations could overlap that can't overlap today. Whatever gets
broken must get broken because of that possible overlapping, because
in the absence of concurrency, the end state is the same either way.

So ... how could things get broken by having these operations overlap
each other? The possibility that we might run out of buffer mapping
entries is one concern. I guess there's also the question of whether
the collision handling is adequate: if we fail due to a collision and
handle that by putting the buffer on the free list, is that OK? And
what if we fail midway through and the buffer doesn't end up either on
the free list or in the buffer mapping table? I think maybe that's
impossible, but I'm not 100% sure that it's impossible, and I'm not
sure how bad it would be if it did happen. A permanent "leak" of a
buffer that resulted in it becoming permanently unusable would be bad,

The patch removes buftable entry frist then either inserted again or
returned to freelist. I don't understand how it can be in both
buftable and freelist.. What kind of trouble do you have in mind for
example? Even if some underlying functions issued ERROR, the result
wouldn't differ from the current code. (It seems to me only WARNING or
PANIC by a quick look). Maybe to make us sure that it works, we need
to make sure the victim buffer is surely isolated. It is described as
the following.

* We are single pinner, we hold buffer header lock and exclusive
* partition lock (if tag is valid). It means no other process can inspect
* it at the moment.
*
* But we will release partition lock and buffer header lock. We must be
* sure other backend will not use this buffer until we reuse it for new
* tag. Therefore, we clear out the buffer's tag and flags and remove it
* from buffer table. Also buffer remains pinned to ensure
* StrategyGetBuffer will not try to reuse the buffer concurrently.

for sure. But all of these issues seem relatively possible to avoid
with sufficiently good coding. My intuition is that the buffer mapping
table size limit is the nastiest of the problems, and if that's

I believe that still no additional entries are required in buftable.
The reason for expansion is explained as the follows.

At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other

The freelist starvation is caused from almost sigle-directioned
inter-freelist migration that this patch introduced. So it is not
needed if we neglect the slowdown (I'm not sure how much it is..)
caused by walking though all freelists. The inter-freelist migration
will stop if we pull out the HASH_REUSE feature from deynahash.

resolvable then I'm not sure what else could be a hard blocker. I'm
not saying there isn't anything, just that I don't know what it might
be.

To put all this another way, suppose that we threw out the way we do
buffer allocation today and always allocated from the freelist. If the
freelist is found to be empty, the backend wanting a buffer has to do
some kind of clock sweep to populate the freelist with >=1 buffers,
and then try again. I don't think that would be performant or fair,
because it would probably happen frequently that a buffer some backend
had just added to the free list got stolen by some other backend, but
I think it would be safe, because we already put buffers on the
freelist when relations or databases are dropped, and we allocate from
there just fine in that case. So then why isn't this safe? It's
functionally the same thing, except we (usually) skip over the
intermediate step of putting the buffer on the freelist and taking it
off again.

So, does this get progressed if someone (maybe Yura?) runs a
benchmarking with this method?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#59Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#58)
Re: BufferAlloc: don't take two simultaneous locks

On Fri, Apr 15, 2022 at 4:29 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The patch removes buftable entry frist then either inserted again or
returned to freelist. I don't understand how it can be in both
buftable and freelist.. What kind of trouble do you have in mind for
example?

I'm not sure. I'm just thinking about potential dangers. I was more
worried about it ending up in neither place.

So, does this get progressed if someone (maybe Yura?) runs a
benchmarking with this method?

I think we're talking about theoretical concerns about safety here,
and you can't resolve that by benchmarking. Tom or others may have a
different view, but IMHO the issue with this patch isn't that there
are no performance benefits, but that the patch needs to be fully
safe. He and I may disagree on how likely it is that it can be made
safe, but it can be a million times faster and if it's not safe it's
still dead.

Something clearly needs to be done to plug the specific problem that I
mentioned earlier, somehow making it so we never need to grow the hash
table at runtime. If anyone can think of other such hazards those also
need to be fixed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#60Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#59)
Re: BufferAlloc: don't take two simultaneous locks

At Mon, 18 Apr 2022 09:53:42 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Fri, Apr 15, 2022 at 4:29 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The patch removes buftable entry frist then either inserted again or
returned to freelist. I don't understand how it can be in both
buftable and freelist.. What kind of trouble do you have in mind for
example?

I'm not sure. I'm just thinking about potential dangers. I was more
worried about it ending up in neither place.

I think that is more likely to happen. But I think that that can
happen also by the current code if it had exits on the way. And the
patch does not add a new exit.

So, does this get progressed if someone (maybe Yura?) runs a
benchmarking with this method?

I think we're talking about theoretical concerns about safety here,
and you can't resolve that by benchmarking. Tom or others may have a

Yeah.. I didn't mean that benchmarking resolves the concerns. I meant
that if benchmarking shows that the safer (or cleaner) way give
sufficient gain, we can take that direction.

different view, but IMHO the issue with this patch isn't that there
are no performance benefits, but that the patch needs to be fully
safe. He and I may disagree on how likely it is that it can be made
safe, but it can be a million times faster and if it's not safe it's
still dead.

Right.

Something clearly needs to be done to plug the specific problem that I
mentioned earlier, somehow making it so we never need to grow the hash
table at runtime. If anyone can think of other such hazards those also
need to be fixed.

- Running out of buffer mapping entries?

It seems to me related to "runtime growth of the table mapping hash
table". Does the runtime growth of the hash mean that get_hash_entry
may call element_alloc even if the hash is created with a sufficient
number of elements? If so, it's not the fault of this patch. We can
search all freelists before asking element_alloc() (maybe) in exchange
of some extent of potential temporary degradation. That being said, I
don't think it's good that we call element_alloc for shared hashes
after creation.

- Is the collision handling correct that just returning the victimized
buffer to freelist?

Potentially the patch can over-vicitimzie buffers up to
max_connections-1. Is this what you are concerned about? A way to
preveint over-victimization was raised upthread, that is, we insert a
special table mapping entry that signals "this page is going to be
available soon." before releasing newPartitionLock. This prevents
over-vicitimaztion.

- Doesn't buffer-leak or duplicate mapping happen?

This patch does not change the order of the required steps, and
there's no exit on the way (if the current code doesn't have.). No two
processes victimize the same buffer since the victimizing steps are
protected by oldPartitionLock (and header lock) same as the current
code, and no two processes insert different buffers for the same page
since the inserting steps are protected by newPartitionLock. No
vicitmized buffer gets orphaned *if* that doesn't happen by the
current code. So *I* am at loss how *I* can make it clear that they
don't happenX( (Of course Yura might think differently.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#61Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Robert Haas (#59)
Re: BufferAlloc: don't take two simultaneous locks

Good day, hackers.

There are some sentences.

Sentence one
============

With the existing system, there is a hard cap on the number of hash
table entries that we can ever need: one per buffer, plus one per
partition to cover the "extra" entries that are needed while changing
buffer tags.

As I understand it: current shared buffers implementation doesn't
allocate entries after initialization.
(I experiment on master 6c0f9f60f1 )

Ok, then it is safe to elog(FATAL) if shared buffers need to allocate?
https://pastebin.com/x8arkEdX

(all tests were done on base initialized with `pgbench -i -s 100`)

$ pgbench -c 1 -T 10 -P 1 -S -M prepared postgres
....
pgbench: error: client 0 script 0 aborted in command 1 query 0: FATAL: extend SharedBufHash

oops...

How many entries are allocated after start?
https://pastebin.com/c5z0d5mz
(shared_buffers = 128MB .
40/80ht cores on EPYC 7702 (VM on 64/128ht cores))

$ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 16512

$ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
...
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 20439

$ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
...
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 20541

It stabilizes at 20541

To be honestly, if we add HASH_FIXED_SIZE to SharedBufHash=ShmemInitHash
then it works, but with noticable performance regression.

More over, I didn't notice "out of shared memory" starting with 23
spare items instead of 128 (NUM_BUFFER_PARTITIONS).

Sentence two:
=============

With the patch, the number of concurrent buffer tag
changes is no longer limited by NUM_BUFFER_PARTITIONS, because you
release the lock on the old buffer partition before acquiring the lock
on the new partition, and therefore there can be any number of
backends trying to change buffer tags at the same time.

Lets check.
I take v9 branch:
- no "thundering nerd" prevention yet
- "get_hash_entry" is not modified
- SharedBufHash is HASH_FIXED_SIZE (!!!)
- no spare items at all, just NBuffers. (!!!)

/messages/by-id/6e6cfb8eea5ccac8e4bc2249fe0614d9f97055ee.camel@postgrespro.ru

I noticed some "out of shared memory" under high connection number
(> 350) with this version. But I claimed it is because of race
conditions in "get_hash_entry": concurrent backends may take free
entries from one slot and but to another.
Example:
- backend A checks freeList[30] - it is empty
- backend B takes entry from freeList[31]
- backend C put entry to freeList[30]
- backend A checks freeList[31] - it is empty
- backend A fails with "out of shared memory"

Lets check my claim: set NUM_FREELISTS to 1, therefore there is no
possible race condition in "get_hash_entry".
....
No single "out of shared memory" for 800 clients for 30 seconds.

(well, in fact on this single socket 80 ht-core EPYC I didn't get
"out of shared memory" even with NUM_FREELISTS 32. I noticed them
on 2 socket 56 ht-core Xeon Gold).

At the same time master branch has to have at least 15 spare items
with NUM_FREELISTS 1 to work without "out of shared memory" on
800 clients for 30 seconds.

Therefore suggested approach reduces real need in hash entries
(when there is no races in "get_hash_entry").

If one look into code they see, there is no need in spare item in
suggested code:
- when backend calls BufTableInsert it already has victim buffer.
  Victim buffer either:
  - was uninitialized
  -- therefore wasn't in hash table
  --- therefore there is free entry for it in freeList
  - was just cleaned
  -- then there is stored free entry in DynaHashReuse
  --- then there is no need for free entry in freeList.

And, not-surprisingly, there is no huge regression from setting
NUM_FREELISTS to 1 because we usually

Sentence three:
===============

(not exact citation)
- It is not atomic now therefore fragile.

Well, going from "theoretical concerns" to practical, there is new part
of control flow:
- we clear buffer (but remain it pinned)
- delete buffer from hash table if it was there, and store it for reuse
- release old partition lock
- acquire new partition lock
- try insert into new partition
- on conflict
-- return hash entry to some freelist
-- Pin found buffer
-- unpin victim buffer
-- return victim to Buffer's free list.
- without conflict
-- reuse saved entry if it was

To get some problem one of this action should fail without fail of
whole cluster. Therefore it should either elog(ERROR) or elog(FATAL).
In any other case whole cluster will stop.

Could BufTableDelete elog(ERROR|FATAL)?
No.
(there is one elog(ERROR), but with comment "shouldn't happen".
It really could be changed to PANIC).

Could LWLockRelease elog(ERROR|FATAL)?
(elog(ERROR, "lock is not held") could not be triggerred since we
certainly hold the lock).

Could LWLockAcquire elog(ERROR|FATAL)?
Well, there is `elog(ERROR, "too many LWLocks taken");`
It is not possible becase we just did LWLockRelease.

Could BufTableInsert elog(ERROR|FATAL)?
There is "out of shared memory" which is avoidable with get_hash_entry
modifications or with HASH_FIXED_SIZE + some spare items.

Could CHECK_FOR_INTERRUPTS raise something?
No: there is single line between LWLockRelease and LWLockAcquire, and
it doesn't contain CHECK_FOR_INTERRUPTS.

Therefore there is single fixable case of "out of shared memory" (by
HASH_FIXED_SIZE or improvements to "get_hash_entry").

May be I'm not quite right at some point. I'd glad to learn.

---------

regards

Yura Sokolov

#62Robert Haas
robertmhaas@gmail.com
In reply to: Yura Sokolov (#61)
Re: BufferAlloc: don't take two simultaneous locks

On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

$ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 16512

$ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
...
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 20439

$ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
...
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 20541

It stabilizes at 20541

Hmm. So is the existing comment incorrect? Remember, I was complaining
about this change:

--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
  *
  * Since we can't tolerate running out of lookup table entries, we must be
  * sure to specify an adequate table size here.  The maximum steady-state
- * usage is of course NBuffers entries, but BufferAlloc() tries to insert
- * a new entry before deleting the old.  In principle this could be
- * happening in each partition concurrently, so we could need as many as
- * NBuffers + NUM_BUFFER_PARTITIONS entries.
+ * usage is of course NBuffers entries. But due to concurrent
+ * access to numerous free lists in dynahash we can miss free entry that
+ * moved between free lists. So it is better to have some spare free entries
+ * to reduce probability of entry allocations after server start.
  */
  InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

Pre-patch, the comment claims that the maximum number of buffer
entries that can be simultaneously used is limited to NBuffers +
NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
size. The idea is that we normally need more than 1 entry per buffer,
but sometimes we might have 2 entries for the same buffer if we're in
the process of changing the buffer tag, because we make the new entry
before removing the old one. To change the buffer tag, we need the
buffer mapping lock for the old partition and the new one, but if both
are the same, we need only one buffer mapping lock. That means that in
the worst case, you could have a number of processes equal to
NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
between values that both fall into the same partition, and thus each
using 2 entries. Then you could have every other buffer in use and
thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
entries. Now I think you're saying we go far beyond that number, and
what I wonder is how that's possible. If the system doesn't work the
way the comment says it does, maybe we ought to start by talking about
what to do about that.

I am a bit confused by your description of having done "p
SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
HTAB and HTAB's hctl member is of type HASHHDR, which has no field
called "allocated". I thought maybe my analysis here was somehow
mistaken, so I tried the debugger, which took the same view of it that
I did:

(lldb) p SharedBufHash->hctl->allocated.value
error: <user expression 0>:1:22: no member named 'allocated' in 'HASHHDR'
SharedBufHash->hctl->allocated.value
~~~~~~~~~~~~~~~~~~~ ^

--
Robert Haas
EDB: http://www.enterprisedb.com

#63Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Robert Haas (#62)
Re: BufferAlloc: don't take two simultaneous locks

В Чт, 21/04/2022 в 16:24 -0400, Robert Haas пишет:

On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

$ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 16512

$ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
...
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 20439

$ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
...
$ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

$1 = 20541

It stabilizes at 20541

Hmm. So is the existing comment incorrect?

It is correct and incorrect at the same time. Logically it is correct.
And it is correct in practice if HASH_FIXED_SIZE is set for SharedBufHash
(which is not currently). But setting HASH_FIXED_SIZE hurts performance
with low number of spare items.

Remember, I was complaining
about this change:

--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
*
* Since we can't tolerate running out of lookup table entries, we must be
* sure to specify an adequate table size here.  The maximum steady-state
- * usage is of course NBuffers entries, but BufferAlloc() tries to insert
- * a new entry before deleting the old.  In principle this could be
- * happening in each partition concurrently, so we could need as many as
- * NBuffers + NUM_BUFFER_PARTITIONS entries.
+ * usage is of course NBuffers entries. But due to concurrent
+ * access to numerous free lists in dynahash we can miss free entry that
+ * moved between free lists. So it is better to have some spare free entries
+ * to reduce probability of entry allocations after server start.
*/
InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

Pre-patch, the comment claims that the maximum number of buffer
entries that can be simultaneously used is limited to NBuffers +
NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
size. The idea is that we normally need more than 1 entry per buffer,
but sometimes we might have 2 entries for the same buffer if we're in
the process of changing the buffer tag, because we make the new entry
before removing the old one. To change the buffer tag, we need the
buffer mapping lock for the old partition and the new one, but if both
are the same, we need only one buffer mapping lock. That means that in
the worst case, you could have a number of processes equal to
NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
between values that both fall into the same partition, and thus each
using 2 entries. Then you could have every other buffer in use and
thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
entries. Now I think you're saying we go far beyond that number, and
what I wonder is how that's possible. If the system doesn't work the
way the comment says it does, maybe we ought to start by talking about
what to do about that.

At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
small number) and HASH_FIXED_SIZE is set, it becomes contended and
therefore slow.

HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.

I am a bit confused by your description of having done "p
SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
HTAB and HTAB's hctl member is of type HASHHDR, which has no field
called "allocated".

Previous letter contains links to small patches that I used for
experiments. Link that adds "allocated" is https://pastebin.com/c5z0d5mz

I thought maybe my analysis here was somehow
mistaken, so I tried the debugger, which took the same view of it that
I did:

(lldb) p SharedBufHash->hctl->allocated.value
error: <user expression 0>:1:22: no member named 'allocated' in 'HASHHDR'
SharedBufHash->hctl->allocated.value
~~~~~~~~~~~~~~~~~~~ ^

-----

regards

Yura Sokolov

#64Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#63)
2 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

Btw, I've runned tests on EPYC (80 cores).

1 key per select
conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 29053 | 28959 | 26715 | 25631
2 | 53714 | 53002 | 55211 | 53699
3 | 69796 | 72100 | 72355 | 71164
5 | 118045 | 112066 | 122182 | 119825
7 | 151933 | 156298 | 162001 | 160834
17 | 344594 | 347809 | 390103 | 386676
27 | 497656 | 527313 | 587806 | 598450
53 | 732524 | 853831 | 906569 | 947050
83 | 823203 | 991415 | 1056884 | 1222530
107 | 812730 | 930175 | 1004765 | 1232307
139 | 781757 | 938718 | 995326 | 1196653
163 | 758991 | 969781 | 990644 | 1143724
191 | 774137 | 977633 | 996763 | 1210899
211 | 771856 | 973361 | 1024798 | 1187824
239 | 756925 | 940808 | 954326 | 1165303
271 | 756220 | 940508 | 970254 | 1198773
307 | 746784 | 941038 | 940369 | 1159446
353 | 710578 | 928296 | 923437 | 1189575
397 | 715352 | 915931 | 911638 | 1180688

3 keys per select

conns | master | patch-v11 | master 1G | patch-v11 1G
--------+------------+------------+------------+------------
1 | 17448 | 17104 | 18359 | 19077
2 | 30888 | 31650 | 35074 | 35861
3 | 44653 | 43371 | 47814 | 47360
5 | 69632 | 64454 | 76695 | 76208
7 | 96385 | 92526 | 107587 | 107930
17 | 195157 | 205156 | 253440 | 239740
27 | 302343 | 316768 | 386748 | 335148
53 | 334321 | 396359 | 402506 | 486341
83 | 300439 | 374483 | 408694 | 452731
107 | 302768 | 369207 | 390599 | 453817
139 | 294783 | 364885 | 379332 | 459884
163 | 272646 | 344643 | 376629 | 460839
191 | 282307 | 334016 | 363322 | 449928
211 | 275123 | 321337 | 371023 | 445246
239 | 263072 | 341064 | 356720 | 441250
271 | 271506 | 333066 | 373994 | 436481
307 | 261545 | 333489 | 348569 | 466673
353 | 255700 | 331344 | 333792 | 455430
397 | 247745 | 325712 | 326680 | 439245

Attachments:

epyc3.gifimage/gif; name=epyc3.gifDownload
epyc.gifimage/gif; name=epyc.gifDownload
#65Robert Haas
robertmhaas@gmail.com
In reply to: Yura Sokolov (#63)
Re: BufferAlloc: don't take two simultaneous locks

On Thu, Apr 21, 2022 at 6:58 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
small number) and HASH_FIXED_SIZE is set, it becomes contended and
therefore slow.

HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.

Hmm, I see. The idea of trying to arrange to reuse entries rather than
pushing them onto a freelist and immediately trying to take them off
again is an interesting one, and I kind of like it. But I can't
imagine that anyone would commit this patch the way you have it. It's
way too much action at a distance. If any ereport(ERROR,...) could
happen between the HASH_REUSE operation and the subsequent HASH_ENTER,
it would be disastrous, and those things are separated by multiple
levels of call stack across different modules, so mistakes would be
easy to make. If this could be made into something dynahash takes care
of internally without requiring extensive cooperation with the calling
code, I think it would very possibly be accepted.

One approach would be to have a hash_replace() call that takes two
const void * arguments, one to delete and one to insert. Then maybe
you propagate that idea upward and have, similarly, a BufTableReplace
operation that uses that, and then the bufmgr code calls
BufTableReplace instead of BufTableDelete. Maybe there are other
better ideas out there...

--
Robert Haas
EDB: http://www.enterprisedb.com

#66Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Robert Haas (#65)
Re: BufferAlloc: don't take two simultaneous locks

В Пт, 06/05/2022 в 10:26 -0400, Robert Haas пишет:

On Thu, Apr 21, 2022 at 6:58 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
small number) and HASH_FIXED_SIZE is set, it becomes contended and
therefore slow.

HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.

Hmm, I see. The idea of trying to arrange to reuse entries rather than
pushing them onto a freelist and immediately trying to take them off
again is an interesting one, and I kind of like it. But I can't
imagine that anyone would commit this patch the way you have it. It's
way too much action at a distance. If any ereport(ERROR,...) could
happen between the HASH_REUSE operation and the subsequent HASH_ENTER,
it would be disastrous, and those things are separated by multiple
levels of call stack across different modules, so mistakes would be
easy to make. If this could be made into something dynahash takes care
of internally without requiring extensive cooperation with the calling
code, I think it would very possibly be accepted.

One approach would be to have a hash_replace() call that takes two
const void * arguments, one to delete and one to insert. Then maybe
you propagate that idea upward and have, similarly, a BufTableReplace
operation that uses that, and then the bufmgr code calls
BufTableReplace instead of BufTableDelete. Maybe there are other
better ideas out there...

No.

While HASH_REUSE is a good addition to overall performance improvement
of the patch, it is not required for major gain.

Major gain is from not taking two partition locks simultaneously.

hash_replace would require two locks, so it is not an option.

regards

-----

Yura

#67Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#66)
7 attachment(s)
Re: BufferAlloc: don't take two simultaneous locks

Good day, hackers.

This is continuation of BufferAlloc saga.

This time I've tried to implement approach:
- if there's no buffer, insert placeholder
- then find victim
- if other backend wants to insert same buffer, it waits on
ConditionVariable.

Patch make separate ConditionVariable per backend, and placeholder
contains backend id. So waiters don't suffer from collision on
partition, they wait exactly for concrete buffer.

This patch doesn't contain any dynahash changes since order of
operation doesn't change: "insert then delete". So there is no way to
"reserve" entry.

But it contains changes to ConditionVariable:

- adds ConditionVariableSleepOnce, which doesn't reinsert process back
on CV's proclist.
This method could not be used in loop as ConditionVariableSleep,
and ConditionVariablePrepareSleep must be called before.

- adds ConditionVariableBroadcastFast - improvement over regular
ConditionVariableBroadcast that awakes processes in batches.
So CVBroadcastFast doesn't acquire/release CV's spinlock mutex for
every proclist entry, but rather for batch of entries.

I believe, it could safely replace ConditionVariableBroadcast. Though
I didn't try yet to replace and check.

Tests:
- tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
(ie max frequency is 2.20GHz)
- runs on 1 socket or 2 sockets using numactl
- pgbench scale 100 - 1.5GB of data
- shared_buffers : 128MB, 1GB (and 2GB)
- variations of simple_select with 1 key per query, 3 keys per query
and 10 keys per query.

1 socket 1 key

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 25670 | 24926 | 29491 | 28858
2 | 50157 | 48894 | 58356 | 57180
3 | 75036 | 72904 | 87152 | 84869
5 | 124479 | 120720 | 143550 | 140799
7 | 168586 | 164277 | 199360 | 195578
17 | 319943 | 314010 | 364963 | 358550
27 | 423617 | 420528 | 491493 | 485139
53 | 491357 | 490994 | 574477 | 571753
83 | 487029 | 486750 | 571057 | 566335
107 | 478429 | 479862 | 565471 | 560115
139 | 467953 | 469981 | 556035 | 551056
163 | 459467 | 463272 | 548976 | 543660
191 | 448420 | 456105 | 540881 | 534556
211 | 440229 | 458712 | 545195 | 535333
239 | 431754 | 471373 | 547111 | 552591
271 | 421767 | 473479 | 544014 | 557910
307 | 408234 | 474285 | 539653 | 556629
353 | 389360 | 472491 | 534719 | 554696
397 | 377063 | 471513 | 527887 | 554383

1 socket 3 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 15277 | 14917 | 20109 | 19564
2 | 29587 | 28892 | 39430 | 36986
3 | 44204 | 43198 | 58993 | 57196
5 | 71471 | 68703 | 96923 | 92497
7 | 98823 | 97823 | 133173 | 130134
17 | 201351 | 198865 | 258139 | 254702
27 | 254959 | 255503 | 338117 | 339044
53 | 277048 | 291923 | 384300 | 390812
83 | 251486 | 287247 | 376170 | 385302
107 | 232037 | 281922 | 365585 | 380532
139 | 210478 | 276544 | 352430 | 373815
163 | 193875 | 271842 | 341636 | 368034
191 | 179544 | 267033 | 334408 | 362985
211 | 172837 | 269329 | 330287 | 366478
239 | 162647 | 272046 | 322646 | 371807
271 | 153626 | 271423 | 314017 | 371062
307 | 144122 | 270540 | 305358 | 370462
353 | 129544 | 268239 | 292867 | 368162
397 | 123430 | 267112 | 284394 | 366845

1 socket 10 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 6824 | 6735 | 10475 | 10220
2 | 13037 | 12628 | 20382 | 19849
3 | 19416 | 19043 | 30369 | 29554
5 | 31756 | 30657 | 49402 | 48614
7 | 42794 | 42179 | 67526 | 65071
17 | 91443 | 89772 | 139630 | 139929
27 | 107751 | 110689 | 165996 | 169955
53 | 97128 | 120621 | 157670 | 184382
83 | 82344 | 117814 | 142380 | 183863
107 | 70764 | 115841 | 134266 | 182426
139 | 57561 | 112528 | 125090 | 180121
163 | 50490 | 110443 | 119932 | 178453
191 | 45143 | 108583 | 114690 | 175899
211 | 42375 | 107604 | 111444 | 174109
239 | 39861 | 106702 | 106253 | 172410
271 | 37398 | 105819 | 102260 | 170792
307 | 35279 | 105355 | 97164 | 168313
353 | 33427 | 103537 | 91629 | 166232
397 | 31778 | 101793 | 87230 | 164381

2 sockets 1 key

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 24839 | 24386 | 29246 | 28361
2 | 46655 | 45265 | 55942 | 54327
3 | 69278 | 68332 | 83984 | 81608
5 | 115263 | 112746 | 139012 | 135426
7 | 159881 | 155119 | 193846 | 188399
17 | 373808 | 365085 | 456463 | 441603
27 | 503663 | 495443 | 600335 | 584741
53 | 708849 | 744274 | 900923 | 908488
83 | 593053 | 862003 | 985953 | 1038033
107 | 431806 | 875704 | 957115 | 1075172
139 | 328380 | 879890 | 881652 | 1069872
163 | 288339 | 874792 | 824619 | 1064047
191 | 255666 | 870532 | 790583 | 1061124
211 | 241230 | 865975 | 764898 | 1058473
239 | 227344 | 857825 | 732353 | 1049745
271 | 216095 | 848240 | 703729 | 1043182
307 | 206978 | 833980 | 674711 | 1031533
353 | 198426 | 803830 | 633783 | 1018479
397 | 191617 | 744466 | 599170 | 1006134

2 sockets 3 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 14688 | 14088 | 18912 | 18905
2 | 26759 | 25925 | 36817 | 35924
3 | 40002 | 38658 | 54765 | 53266
5 | 63479 | 63041 | 90521 | 87496
7 | 88561 | 87101 | 123425 | 121877
17 | 199411 | 196932 | 289555 | 282146
27 | 270121 | 275950 | 386884 | 383019
53 | 202918 | 374848 | 395967 | 501648
83 | 149599 | 363623 | 335815 | 478628
107 | 126501 | 348125 | 311617 | 472473
139 | 106091 | 331350 | 279843 | 466408
163 | 95497 | 321978 | 260884 | 461688
191 | 87427 | 312815 | 241189 | 458252
211 | 82783 | 307261 | 231435 | 454327
239 | 78930 | 299661 | 219655 | 451826
271 | 74081 | 294233 | 211555 | 448412
307 | 71352 | 288133 | 202838 | 446143
353 | 67872 | 279948 | 193354 | 441929
397 | 66178 | 275784 | 185556 | 438330

2 sockets 10 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 6200 | 6108 | 10163 | 9563
2 | 11196 | 10871 | 18373 | 17827
3 | 16479 | 16129 | 26807 | 26584
5 | 26750 | 26241 | 44291 | 43409
7 | 36501 | 35433 | 60508 | 59379
17 | 77320 | 77451 | 130413 | 128452
27 | 91833 | 105643 | 147259 | 156833
53 | 57138 | 115793 | 119306 | 150647
83 | 44435 | 108850 | 105454 | 148006
107 | 38031 | 105199 | 95108 | 146162
139 | 31697 | 101096 | 84011 | 143281
163 | 28826 | 98255 | 78411 | 141375
191 | 26223 | 96224 | 74256 | 139646
211 | 24933 | 94815 | 71542 | 137834
239 | 23626 | 92849 | 69289 | 137235
271 | 22664 | 90938 | 66431 | 136080
307 | 21691 | 89358 | 64661 | 133166
353 | 20712 | 88239 | 61619 | 133339
397 | 20374 | 86708 | 58937 | 130684

Well, as you see, there is some regression on low connection numbers.
I don't get where it from.

More over, it is even in case of 2GB shared buffers - when all data
fits into buffers cache and new code doesn't work at all.
(except this incomprehensible regression there's no different in
performance with 2GB shared buffers).

For example 2GB shared buffers 1 socket 3 keys:
conns | master 2G | v12 2G
--------+--------------+--------------
1 | 23491 | 22621
2 | 46436 | 44851
3 | 69265 | 66844
5 | 112432 | 108801
7 | 158859 | 150247
17 | 297600 | 291605
27 | 390041 | 384590
53 | 448384 | 447588
83 | 445582 | 442048
107 | 440544 | 438200
139 | 433893 | 430818
163 | 427436 | 424182
191 | 420854 | 417045
211 | 417228 | 413456

Perhaps something changes in memory layout due to array of CV's, or
compiler layouts/optimizes functions differently. I can't find the
reason ;-( I would appreciate help on this.

regards

---

Yura Sokolov

Attachments:

2socket1.gifimage/gif; name=2socket1.gifDownload
2socket3.gifimage/gif; name=2socket3.gifDownload
GIF89aX��###+++333;;;CCCLLLSSS\\\eeekkksss|||������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��K	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����+^������#K�L�����3k������}GI@ ��F������ ���p����.Zo]��L.�_g� @����� ��u���K�<&k��/h`����s� ���QY���=���r��(�����% \u \)�f�w(�~t'��9X�x@���9�@���pDA�i�r#��6@�!��-@�Z��G@#M@n
�7P�Q (Y8)�v��%f
�W�(�W\A`��4
��E0r^��
���-�`��Iy��m)�da��@p���%@P���@
'@i�q ���P`�n���t����8@���H���8����	����)���6bj�Hj1*)
����R���
���9{��Y
(��j)�X��
hH������
 A���:P��9@(�����ZJ4�Z)*�i�����
�%����t|����U��i<��9��"���+�AG���@���Yh���(����L���|��gk����v	|K��d
'���-��pI���9G�P��K��y���"���P�v�ikSkuBWS+���[/�u�����dN�j�#0�a�! \�����$�E\���p@z�@���B:T
L=%w����%LP#H����j+�S�;��n��p�A����.,4(
L�������s��������	P ���N<��Y,h��{,o
���f��^Bxz)�* +��+������
ZhD����9�,G����%?,�X��V�
 �T�Mq����Z��Q<.X�4$9�mK>���g��-Y��v��RH`X��9���rJ �k��<�����d���������X�J���=���.
�@9;E�6R@	�����M&��BI�c��C0
�h�\$40��L�"��F:�����$'I�JZ�����&7��Nz����(GI�R����L�*W�!C����x�+��D��
o��.w��6��ZIC�ILb���e���Lf����d�@�@�jZ���8I)�O �`�m8�@~���Ed+��<�I�7T���	g���O����Z�4��� �C�C����:��
�I�i|&�6>j�Dr���z��9��?��X� �F�>����5�i��m"����8
 @����b��3�BB�����8�7K���������@�����_B%�Q��O�F��M�COb���� 
�8���.]H��Q��k�\��A�
W�r�v-�(<�.�:l�)D�@�|9����U<��KrB��������%
���n� ���jK�Y�Vg�m�g���U���hI�*��Zv ����8�����9.m
5��$���!u���]���)MR�	�9Y��m�J����8���`[�D�nJ�v�q�@�7�$/��h(�0pA`��*�H�x����5�����T`p{T�	��!����+��~���5L���MkB�{�d�-bon�����T�~/��R�v=��@���� \0�C����lx v�**i��A���r��dc��BP�Q  ��4p�*������
�8�P���
���*������C����+a���
!��1�5}]�p[�*�-F�5�=A�~\�i������Zs���<�b��s��p�/������(�M�|)`�5����p ���f#r��Q����2HK�&�:�a�S
Nn�����x�\�����T'2�mA X������:�5}u*�A��l�r(p�h�E�����4�Rp�!���B�p�%D�����j��\ �8�Ll$)�t�N��y�P��������[�@��n/����R�������/������g]�G|��,������3�s�&p�~���F�+�Z���
�m�7���Y��B�C5���\���s���?U!�����&��=�@�����;�T����s������@����Q��Ec'���xw{�6�����O�yf0{C�{�n"!�/��VO�Y����a.�,��E{�O��������(@�o
xRN[�N@F����f7r>Oi(�;<�����3D�'O�e��2$i�C�r	H0�)�	��xQz��s�d�72�~�W~w~��2��v\�,��P:��umF_�\Y�W	�]`�8�PV�,p%|�2ohY�d��z�b/�"+�AW�*(X
�s(���R<�rr��-��y��R^��h�^�q_�U6
�lfx�Q�8�x8�s
5gK�}m���#���r_��t�U
Z]8/c�4D�W7�Q���S5�q�b%�b��%��}�do�zL�z�w9T"��Z`|���]�hZ�	�j�cH�6���c]%'��"����l�?$�FU�T��vX��r�gH���a����y��]8��������g���c���8A	�hT��]��"�q9�,Pp'�e��ZVl'I�`��x��v< ��	g���j2��7�b���!�_/��R�hN�a�a��j�2!HRPP��/��d��j�mX��|(��R����=�d�45�$��k��P�H:&��
�*Y,�p^�1y�@�b�\8��$�	R
!,�%N	999�Df�c��F�S1�A�I9�c`���$�������\_!��d�qv�hq����R��j�9��P��y�IO*@On��[������9��d������ 
�@�_����U!Q�)���_��[��P�_*����������������9��Y��y����������������9��Y��y��������������:�Z�j�F������:��]F�*|%����W���"�|�:��{e�9�'��
�� ���}��0:�G�G`{p��9����	z�� �>�Fz�9p	q	G0�N����$P�r��p`:5c
s$��k!>PT�G �X� )�P?Pv����]��pH����p ���
q!���n�	;�q��`?]���������A=�
�>��j�G��Qxp|�	2�x0�i�Y@)B
�!�����R9�a	�����,����}���RkJ�ef�"Y`/�����Z�S��0��x��!1
��WpJ�b�Fy��J
��:��M�J�����zb0�[�{���
�������;�
�W4`��P|7�� �";�$[�&{�(��*k�/��8��W������ ����{**<[�;�Mj7��{ ���!�e�:�*��y�)��� g9�V��[viq	�ZU
RpZ�� }�� ��} �`��	{u�O��Ww��J����	��{��mk{�1
	��z�+�Fy��
�y�� ���Q>`x��q[H�Q�
� ��*�����	�[��;�n��W`����{%����x �;���\1����+���Lq	e��������Q�����K��Q�����PqE�����tk����{�F��IK�a�������tK+�h{	j�GP��������Q���X�t��5�1
i�> ���o��� �q���L��ljj� ��������j�L<��U|�� u�|m���>����Q����u�Q��{���H�	�{L�QGl�K�H\���,���:<����l�=J�9`G@�&z��W[��e�����q	w|���Q���j����<�}�H�����e����
!x��{���@Q��q���2��~�}`�<�n�B`�T�������<�����X���{����:�k��F���;����RN��k��4���>�@l����k\��u�;�����	!l�
<nA�n�Ib�J� ��]������	��,��L�|��|�E��<Q��	}������9�E����������n|����-��Q=����B��:�
�[���,��7���p��������z	<�R�L:|��n����q��1�� ��_�{P�N}]�/-�5���u�,(|�=RUMW���r�+�=����u;(���`�_�%L�`��o�!}p�z��:��Y��
�a}�����������)����+<���&m�����������	��m��=��)��-�����h	��Jy�a�P%�����i�mJ�-����d�\�@���d`�P�
��p_P<���I��P8��	�`@0E`���e����
U5 9`!�i��!`a��ji�����6�Q6�g�$P�^p�0���@�� ��H�p�U_������I��� ���)I��QA���h���������@7�L:�H9����,0O�P#�G�H��!Ne%��S�H@@C�QGP�aZC^^0]�Ei9@�aZ3p�Pad���q��z #aZt�/�mp�p��z�1&P���TjP�`��)�<`��M������#��!���T����+�mgn�.���7��w��0	�`�6��Q	u��.�N�t�Tu~��g�K�	
!`�jB	�^�ME��fP>��#!�^!aZ�����~Nu Q��3�TM�j 	y�z~@����n���T, �P��E`	/��@ �.0��q�~����N� 	��?���}0�#Q�a���������m�q��/!��*���p�a	t0�Ep_
�@Ov��q�_ ��/L����pjO�~�N��^���^t`�!�9.���NO�>�l�����H_��n0�LN�kQa��$�]_Q^o`Zj���o����zo[N��Tv`��E 	��Yq�t���*���p�](�m@O3��w0������%�`b_,�)Q�E�!	@�������o�^�#����V$N�p�.�a	u��7���?��-���A�@3�D�P�B�
��#c��6���8��C�=~R�H�%M�D�R�J�-]��H
�>0��h��@���<	Q�
5�F��q�$�M�>�U�T�U��2cB���m�\-e�����v�����,���W�\�uUf�*����V��4���@�����|?�Y�d�C:R�)��gz"�6h&��0�Ob�vF;�P��4�Te��mG�@���
Xp��o	�(^�x������������1�0T�!���K*��R�&�Q�b"��w\|���m�����b`��Y:�'��P�� 	�$�A4
l	�e/���
/> !���`�!��H/#X�D�ck��f��F�fk�-�H�����$�� ��H"�,�H�Bx$��V��6G
;,�I�c���Nl��`P��R�1M5��@��1�x+���7
���
���
�� ;���������JSs�J�h#)�4Y�RK%S��.
���<K�������?MU��
Ve�UW_�5VYW��r�`V]we�x6Xa����<�`Xe�e���2R�2���^��6[m���[o���*.�0��7� �#�l� &%H�w���H�tj�5!�D�7/�m-��EI�(B��dh�=��Pj�K/���
�@	���.h$BD.�@FL�R�B��"�C�R�������������b���NTC�^h�L:�����!�
p`���+`������
�����+�J��)I����>���41�J�|4����C��z��J���j���z�29b�)������x|�J.��6Zt�$����s�!S����D:�u~\���^}�ggx�`��A
7�����;���g�%4F�{�W���B(�y�fH-q�F�d�@��j�����KQ#5;��?~�B���)W]�F��y���(41�;��a�c���\B�����i�I��R��
=)C)��ky�SSZ"���e5u��F�����/�����5%nY^(dX�#A`��E��;Ofp��
w��(J��)e�W��Fa�|����(��x�*N�K
�$�#&��
�0��s���sc�#I.#A�8���
�G��a=4������L�,i�R�GHJz$+�s�p��P|(�<K������x��H��R%P���(wJA=8E<�������VF%jJ��A���:��1Ml1	ME��%(�����)2�Q/���hF�V�(>W�7�A��
D���������Xr��L����%/�	:�o�4�������=m�3W��M�'6��S�7�nN��a(L(��!H�C9���$��43�Q�p'6�`����,D��R���-eHE�@�f�<��#D��6z@��z�Gp����0:U�D�G��
1;D1%�(�M���5�����3���~@���@>��VK���+&13y�#�T�;�j�����/��� �Lr��Ea��@\VMw`Lb;(�o�C�&f�9�&��lF���������g���N�5�h=bHa�j�CQ������l�
��*E����`~�&R�f�"qT���7�a�-znC����6�h	4��a��}�r�`���w*ed@��B��<H|��}���xP����R�x�#�%��BQ��#��%��6�w��\����<{�gKD�H�p���<�m)����^��&r����e�B�`Bco7b�^	3`�i�'�;g6�a�!�)[���>A'�Xc��Z-w�<���O!�@�|�|c�xbI�H���9�D�V�GB!�B������	� ;���O�t�	�h�<o_���aN��R��4%"B��N��Grh%Z�<!��uM�2|��/�<�Wg:O���G�i�	D�����p���O��A�|�dC;$�K��=n����}^��n���#xL��=o���\��
���X7���n��Fse��*�&|_�:<�������d�g#0��%�����fI�D|����ny"�R@�;�ay�e^��%o�A��w{��	@���$��.I��(b$����?�l�0Z�F/���8X��$���-Xz!�@r���n����+�Nr����#<$/�6��]�a �S��)D!%t�;H\~�G�z 9N��?����@
�W	�#�������o�#H ������^	�A��!���n	����,�<��?�`H����������;�JP�f�����W��{r�E����^����@b�$(>��W1�;%�\��]�{�#���9�����b����B���+~@
O�X���	�����!�	��G0 �? 0G�;��t �#=�x�9�`?l	Hp�2(���	���!�1���P�
\A�:� 0���KA���C�6�X�' �d���	@��3�@�3�G�A$��0�"H	Q��(B��B�@AX��G0A�4��������yB	JX����.l����� �K�/������>��C�p�@��188,D�������*����Az���@�����M|�Pxj��"��@>�h&��Q\E��@�0���Ht���Y�0�+	��Xd>_<���8E/���3���x?�@��K>�h�Y�����KHh<08A3,/�@9����oL�C4���uJo�����Q34HH������0���|t?@������G�|
*��Y#AT�4|D�JD�(*��.L�/�H�7jPG)4	?��< 	CX�l��c{�,���1�E��g���[�p��7���4B@��#���W������t6E;�/����G�F�hG3L<@����WK�p�%H����4��J���������KI����8��4c���������QP���&�PX*�P��X�0L�D����'��J������)�/��Us8�/����W�L�D�K�|�������@�	����H�P������+�(���t������*��> KF�h-���N@C�6W�K�|2���f|�0������������O�j@����T	�2O���y��8P
!���O������D�/����8����I�K�4k��4	��<�J"��DSD�0(����!J�����	,X�K9���PP
������JXrPG�Cr�4����Np�����{z3�CH��������'X ��(��&U�Z��L�,�������P0����P���%�$�)BuS-SN��������Q��&��R���P;�A0M�@��|/=�J	,�-��6��Nu�H��4�7��QO5U�H2 $;��k��S��@��(X�J�U��A1��0a�@c	3����4���\�;-(� �U�����A�yP�	��T�x�;}�<�V�c�� ')�Fh�yqv��w�)��#uC`��!C}W�3rq�@�`P��A�?�:)���H2�X�b��������-� �.@AQ!�@S�X���^Q�p���$x�����%��5���� �(;�z��R�W}-��
�I�3s�J:�!��#Y. �R`�`��)�.(����-�i�		Yu'��Jr�RN~�Z�#YP	��0:!��E�Y:�>x���@��+��W���#���	 ��R ���&C�K=�\��k����d+��h�����]E4P2���XSP2L���M�[�H�{�J�4�]�HH��Hx"�$�%6�=^���K%�&��4�<���4D���]��I��U.�#�<B����$9�=�I����]������~�_�#�V���}�+X���<H����HB�0N�\,�Z;sRp�`��Dx],�QN(��`�s\���_-��'�6M��V�G����L�}6R(6��^~60���	�e�q�����]��r+���6����$�� p� P�`���W���b�bb�������-�~��|,�0�2�b��X��P��[L���X��_8��<p/�3�$�X�+��]�NNBh_@�+�J�Ah��;�6�aHF*Lp��x��[�(X��dP��>�Z�8cd��=�OHN0�S��Eh^���|�L�][�%8��d	.��Ia%J����c��=f#Rh�� �b�bj��&@�GTc��D�en~�D`�Y|^��;R�C�b�[�G��D��q�6��^d\�LcR���m�D�m���X�Z^A5�#M6IC(g���Lk����O�]SugxN����86'���=XRH���z6�N���>	,h������_��A g�v�ea�N	\f��>������D�g�nc���g����f����]gM�NXD���C ��������`M.�@�\��g�pgDe���^	���0^k��j�d�L���F�E����P��4�.��m:0������kM�B�v������+��^a����^�=U���L�k� ���D�ki~��C@�^�������7��QH
PhH&�����c`����������������fM�c��l*�.�������	R��vmC���~�����m��nL����A���6�bL����g���.�����l��B��<�����;���v	�$��em�������v���\v~���9
_p�KiL�]�X�K�mn�l��m����0��F	N��8�������6�
�N���&����W{n����������8���<��d	�$���l7�{��%��L����)��B�47	�~��Nn-���(�����6��m�.l!�R���6s*,l��J_�,��l!_�EpmD��"��{�piV�AoRCx])O	'�f��o��tH>�1	�v�R������^���uV�����?��\��]��^��_�`�a'�b7�cG�dW�eg�fw�g��h��i��j��k��l��m��n��o�p�q'�r7�sG��i�t?^���`����x���@����w�=��|Wz�Z�@��'��7��G��W��x59��[����������������������qD����A���-�8����� ��oy�H����R��`\������`�����y�/��'O�
:�������P��z��n��
@�R�z�xy��z��U��z����?{��
�/yh����?��(z�W{�����y�/	�ozY����F(��_	j�V8)�����M�z]��(��������`�������8W�@�k-�	�1���H��Y�������e}��V�V	@0����O��}��R����x|A�jM��/��8�������O��	�/�?�'��'����,X��w� ���~�/��7������R�~�/8���R,h� ��
2,��A��t0���C�
�d�����4"d$��J�lD�"I�&�:���J�0c��xq�@CB����E�+!���@��H/JZ��R���4������M�K�RKQ4�v'I�A�l*6j��k'm��'N�Z����/��ZL�TX'`P��
U@0����X����R�=��@Z.Hg���Q#<
H���� @�m��>(Z!������P{��F%�-���0���6����W�����B��������A��>yg��I~/U\�q�'�s	���k��6R��HJt�Pg]��v!�5A����@Q2R���a)X'�!b���)��@��XZR���hY)'.���`+���H8#A7
�Ci�{uA�~Q���%6���Di%�mX�~1��� ����9	��6��&DaN	'�}~6��(�W�]`���Y��"\�D�������Q��_f�@�j�T�)&JVjY�SRX����yVsm���z�%f�&+����1�kL�RJ�����l������^�'�q��Y��8�z�	�a���HI��l�UP�dE7�(��xQLg���%�A�����R@�$1�Z#l=@�JL^W����zi6������!D����H�2\�"K��F���7�St�B�.(� �n�6�K��)k���v��F.LRc�h��BP@J}]@�$d�4�cO
�S�t����=�`{�$�v�������3k��������g�5�8�7{;�8�S�YeDcF�E���u�y�g�=y�rk5���>;���~;����;����;��?<��<��+�<��;�i<��K?=��[=��k�=��{�=���?>���>����>����>���??���?����?����?(���< ��2��| #(�	R��� 3��
r���;
2socket10.gifimage/gif; name=2socket10.gifDownload
GIF89aX��###+++333;;;CCCLLLSSS\\\eeekkksss|||������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��;	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����+^������#K�L�����3k������u1I@ ��8������ (d�zB	�@����hM�uB2Y����A�	�gp�Aq��k��i�u�6��T�u��Z)]�`�����N����`��@�������A��V�wp q	�'�b
7_'�@LZ��@G�]��@����@�W��0��(!���		@�A��A�@`�n$�BU,�c,`A�	H[k�(�&��t�u�u�_�qa
q�(�b.��&h�w\A ����|D0P�����vR��- ^'�~W�pPPH�-@�q@�ls�ZX��U�&x,p&A��
@�YG���g'�Rp���@_�n�� ��K����G��)P�k��)
��Z�U�����	���R��@.�j'�)b��u��St_mCbRak�H@B��c'��fm'�.���wy+�@%
 ��`���*�I����A
��������nk<����h�A	SGPT�.�D��@q�zo4*�Nwrtke�Z�B�A��[�U����0��x�@�I\��l�	�a��w	��A��F*�e]��(+�eH$�Q�0���-�Zcfz�hw��	�����B��G���~����P����>��gU�l�	Bj�b'VG=5U��h���Ak�aB���xk>t<�o����cI@\�Dr AK�
a�����&:��C/-�f�:&
�@P�=�1N u�\D�ze�J&HAUm�8�BQ�R����
���0��kV��y����@������-"j�:���������@����34����rB=��X�[��?�
j��E��Y�5�#��D�](�h�����MT�:h�@$�F�DmC$�u#?`R�q��2O�i2�����4:R*�������(xS�#�Fr,H��#0��|H��%���B2�,)�	��2Hr�����$���I�b���L�2���f:�����4�I�jZ�����6���nz����8�I�r
��L�:���A�e
���<��x�A+e�C� �/�3_��@�0L��ar�C�P��r0�@8���Z����I��+�0�{�`�n�/@yk��
��[>�0��Lg���|t !�Q~�m�T�%m��R����Q -�HB���:����Dn:"/A���H����z�6�)t�K�P�_V-�>F1����5�)��i���4�U�B���#:
A.m'f�X��t"G�^��MF�E���
X-(`�=���������+]#r�r���BP��A����*EZK������m'^���l�E���?�d��i���k�����)
�E���"��������L)��niG��z��	���n�m/����R�e�RO{����!����'3��P"���D�%��Wf�%�|��`����Ec�[�z�L`�,@ �*�4h$�v����`������V�!`���V�1�
�	���I<�x!�5/c�k���5�����3#�3��bf���K���	+�0"���%�iS{�M�8�Ux��\\�FV��x���1���(��,5#�{Q2�� ��&$�[��0�"Wa�h���/=�LD��rEW��J��/��;
@Z��uA
������}��8 U�����
��~

 W�pt��hy�	�P��e���+�!�A��mL�/�
���������&��s�]W�r���@l�q4tVl�[h����6����[n�����m�;��t�X�fQ�B�xs�����m�F��&$��3�%����+��P���1�J�B��3���r�<8@��#�|[_/M'`������
�/v����c��	�6A�u�S=��>��&�?�;!��7Sm�����H�@���}E��]&�'��\�C��aBr���	�oQ�����=�>�<��W�Y��	$>��:1����<��@�	��MT������p��F:�,mz����n��5��n��4�7��_�)�i��?�x������f��V'e���wGH�O��&��K��fLP�=��pg����h�@�#
i�wm�/a�����U��P}��V�+�z7rbh���"bM`sL4�U@@��+)�]��4H�P&�
�y��t�2L	�.�n�RvG^�S��j����r�����d@	`C�Ex���w���'��"(��,�.�O�"��4�dP���.�j��
�U/���A�K�txh�zXz�gQ�%�:�z�g+�+���% ���#x�C��q��ndwt��|�qp!~Fv������� ��`��x��R��K*X~��=����o��H��H��s����`t 3@��P�*Re7{�}�U���}L��,�-or���1�	����z�g���!*�X����`04	�l�C
�m����`���a*��Y��	>�H����KXt4���
@m
yF!�^
Im���`�5�V'��l�~`�T"gc�+B�����d���� �[m�g���=n�0PNb#���zp���*i��(��Q��������7��.�A��!=wh@�Z���	!�&��-�f8V@T�G)�%5�g���u`�2����������h�g�/���z��%������1c�u��"R�&���i�J��S�
������B�!��G/�i�ocz5��T���eI��R����[���'Q���������yZ���i����p}���Q����)����0S�9�2e��	��T�y�i����&�:sn�	��q	����z��	���1��	��z'��q'� �":�$Z�&z�(��*��,��.��0�2:�4Z�6z�8��:��<��>��@�B:�DZ�Fz�H��J��WQG�u���d7����4��2"5B*$R^��O�	�&�i����]�C�
FlZ��lq:�U�-�`J_QfD��W�s�Pc�3��(�:G�3�����>MA����i$�V�7x��V����R	�A��Zi����[�i��#��2W�b{K���
������	��0��Z��z��������:��[9����Z��z����������������:��Z����Z�B��3��/����
���|�UH�H�Uu����-��]�k��" �"rz��s��Y�NA�\	r�Ur	�
70���shQ2$��#)u++�$��Nr��P:�u��Ap�F��zP�	��s��;K1��dj�1	s�U�z�L�z�wp=�zX��+�9*q�����UU�6�@��P[�ZrlK�d��$�����w�Z{u�L��P~���4G�g �[��s0�sK������s{���L�j��TY�?��}	wP�]3	������D���	e`Z��r�UH���r��[u�K�]�q���-D�A`�	���
����]��`�a�A����Nj�@`��u�����H�P��Q��[*������U��q��e�^�[z �q����#=���	�}J^�����[��0���'`Z� �K�A��e�w���� |w������T���Ur�w u���k�O�U�;�}����'|'��Ud@eA�	z0��[����~�����$��\�ea���G���"����U6�|ck)�OL��{�[H��]�����kl�Y���_����}��2[�0�u_��<LLN�dp�!�s�JQ�\�q0w�g�o��Q�:KKE�!hL^�U��,���o�U��A�����\*��������[�����������[�����<���1���������l�G���p,|���=���[Q��=0�}�	�����UO�[�\���H����!!	����y,��|���P�������	B,�s`�B��l$���1	up����w�������i�_��������!��	!��d�Y�\�H��#M�s��,�!�G��'�{!�1�\L���������t{O*�	U9�z��;	������G����!�\L� �M��U[��
	�=�w��%���v���V��������{�%z�� '0�
� Q��U�]=���,�|�<��<�����L����@N
�+����k�����]���P��z �!�tc`lm�q;��}�m���[qP�.�
q�c
,�EPv�-����a�#���}������]_������������'��K��L�u�
����B�����,�
������G�Jw���4���]�_���q=.1(�� ���9�YK��������}q%L�L:P0��B~�*����sP�s��8�U���T$������1�U�s�����C<����Vl�kZ ��p^
����V��������nx�T&�8=�41��N�l���;-@��-�a`<�U<5���~.�]<�U>`�UA�[��rQ��M������k�uq�eZb��^��w���x*�M_��3����Ll�81����csK�8���L�`��)����L^0d�<��~L$�>Q��^L���<��OLc0�����������.�h�FA�D���.b�e��*��;B�M{�����H��E�0�6�>�A�7�v�K?�B����a
�.+�'����=oh�/S/F�Kx�YY���'���<�d��b�bR��*�����J�^pZM�e!�<�KZP3�����	����:c�^#�Tm�v/!Kx��'5�
�����/��
=�����.p#07�-`0����?��_���������e��mZ�:n�2>#�%�	���	�0�����^��3{�a�P���?�[_�Fa^��%B@�
�Y!���)�u���,��oB-_���4����h���I�@�
D�0a�
>��$0\\��"�n6	B�D�%M�D�R�J�-]��S�L�-+�T��D�=:���
'��<��
'C!���U�T�U�^��Uk�/@�lM���P�kf�0�����0��	V�\�u����w%��z6:���3-X8��X�b��;������-���E�c��=�" s+�1�hL;!D�X�I�l��m��z����yR7��%D2����L���?�>[rb8.h�y�Gao�&������G����q��qa�U/!u��F�����0@S��H����:$.�d����B/p:��3bA���O�	]0COD��LL8�.N�3�C��B��FD��FdK�G�������h�E�x���Xa�^pa3M�8c�fH#IP�H1�$�!
�Bm�F6R�I�!��B�!;�8�G�0�"jXC
����`@�LE�q�3�q
7��NK��� Mi����hc�r<FOE�B���J�K�RK�0iS5���7��O�T�<�2���GMLVKT�I�8���`�Zkm{4*N^e,Veiu)��`Xn�k�EW�Ct��*��s�1<��S����>�,7]��K�����x��K��D8{Y�#j	��
3�4�
sB`�?��!�`�C�<��1C�8��Lj8�0V���l �N4��b���x�K���h��AF����������/r��h�v�#
?����5�K9i���""��c������&�#�O�������l�������z+N�PCZS�F��y�
8�Dj���wB�X���#�9q��+�V�:��#o{���v�6�(��4����/�s�qCa7�|�p�/�����.H�E�����{��1bnw���K�m��P��������8���#2wdn����C~hw��R���g���!���P�	q�%y0�
�?�D=��T�`:y��S��F���m��������'t��j�2
�/H"��5�[�R��`�H�OfX6��g&�
�	0-*��N�	1�����x����-"m8C
�(#������K8����P����B�*jU�W
3���d�h�"�3����
oP�"4�	�?3x��T8�5� Ux��8�	7��pqC���3������
�`H!�o�z�xI�mb'Q1V'��z��D������	=y�
j(�)�����%�A�hi���'��0y)f���Fq	'6�F�2����A�x��Ms@*��O���J<��O��2A�<�Dp0
b����q�T�`er�e���7!i�@��I��J����<d�LTjb��(�	���5TJc��5�!�������\:��MU�2���H��7]p������9i'�F�Pe*N�R����<��O���D��z���N2��-B�a���:D�qR!J���Z�������X��N�'H�^2�l$�pXC
���a�x�Kn55V�a�{p�HBD�OHUSRC�b�`~��" +��!�h_W�I�~V��%\�:@v��X������q������p�P40Be!Q��4�2i���`@�a�%.��B�N���S�V����L��V�Y GYJ�s����t�^D���%pb�4<�� %*��'xvL��}a���������B7��
�\�#4a��3#vyO�����
�D��P	�
��D����h��h���V��E��F��K?��2
5������B�N<a���R��~p�{�L��~���J"�?R�m���k�J7Q�`�'q�!�zr�q/%ROe"$�������� �����Lh��Ub(!���h��,��	3d5OO��D����g��]�v�Y��n�%_���	�p���/�����-!x��%Ra���!�0����6��MY��	m��������	(�L������d"'����mM���,95G*z����q� �5�L�$P�!��(A�(����I*�W%aEZ"z9�2m�O����Sh��� gn���p~�\Q��4nl�"��� :��K�����h�y�kb�!��~���ao�m�^B
Bp�0DB�_�	<c��d'z&
���.��l�* C��>����,q�@�
_���c���S��C�A��9����U7�B���i"��e��$������@@'&@�8�� (@  �����_�I��l��@�P��O���Ld}�b��@�p�+n�Z�(�1�Bl�7H ���t`��������*���O%;�`���o���T�0�)+��g�	B��8"d�?����N�?�`��:�8sN�1�S	���MX;��C�:YA�-L��� ���
H���N�����,���8� ��pZ�

��L oQ���@i��(NP���
Pt�N��0��;����
��A~��!<���xA�������3�8�7,�L�
��;�C�<��>��?�A$�B��'�CT�Ed�Ft�G��HtD~���I��M��N��O�P�����
(�?�����>�������?�#	�{��@���
��+��&�*�.�8!�� C��  =A �N@=0=j,X�j\A�H�+�@�/�
(���H��C F;�:l{m���J��p��[R	�F��
�+C��/���<>��D�`���G����HU���`���k���x��y���Hk�=K�@`@�����d�H|�Dd<I`�����%�J8���:��!�II��|.�0������@�.�C���;��,J���R�(�����e!�L8�2�(FH�/ �Dx��HG�������|I���;����<��.���1������K��%�J	&K�����G{!���D01��}���`��4�m����������'��10�H�4�,>��
�S�%`���������e	�C�/�KI�4������,N��G�p�����N��'���d��$�0(�T�S��N�@����$����O�h�K�L�@����G�D;9�����L��t�CP?"H�7��!0H�|���I�X��l���P��OYQ��l�pK/���D�2P=�L������)8����#L/8��J�D����1G����K��{������DPK^�P10��OY9P��K�.��
?�5(��O���v��K��D,���&J�&�US�(:~rK.�MC R��RKi�HJM����S���7=CXN PK9������$�9�,�<	>�) �I�
��	5F�D(�1X������#U��,�G5S��=V�
W�����%�0�R�J��;	���Ob�
�D$5	~��8-�H�C�E.@�����s���
?�V2)Y����H H�4P.�J��W�=C	�C	T�
Im�1	�`$+�{��!e�w��!���x�`�
U-�2i�fE�L�U/�T\�Xe�1����(	���x����M@���R�U�.���Z�X�����4�����.�SKy�|����K����Z 	%���`���#����+���"�T�����Z �JPW� X�
]T�����T��2��N��K���mY�������'(:�EwL1��7
�.�SK�$����T�
V��+E�\
�Uz��M���������m�z�;p�sA(Z�8���������$h<��Z�8-�^����-]���}^��<m�%H��H8����%_���*Ah�'���U-���	�MN@)�������u���^��(���V��=��QV`�8��<v��`����(���?@_ �1��`��%v��V`?����`�_^������f�K�������aT).4�����C�!���zr�7%N�"fa �6!�bk�����
:�}�$�b�b�qb(�)+���b��_��4��"�1c�r_8��5��za<�J*0b��������as�
��k�?��C��)x<�	��"�Jp�'����d�K���?xc�R�=�ON��t����s=l���cB��dKP"N��'(dUF�0F*���T��������>���$����cZ��A�Hf��B +HC�[���Ct�,����cZ�{>(eG~	��2��H�M8fB��`F4\���N�A8AL�JPE�)N�3��$��{�{�M��NpF�6l��B�NA�B�H�&p���v���iA��&x�&H��������������E�-��)LL��N�� �s6g|�7<�6~���8��������
��aN��>BhE���YD��2���3{r%xp�����j1a�����F	8Cn|=m<�X�?p��p�;&2/��_�kT����i_o�J�������V����	�������w�'Rm��4��9��H��������#C@��zVm������IJ�g?�f�F������MF����6Fc�F�X�����-d�(����v�A�d��c�fo�p��j�����,?O���^���?�g{��=JP9������	�c�r^N�@�����&p����B�������A����?������p��?����J8Z������p����0���A��������������'��m��{�n�y�!�(#G������or�|r&&rX����^�^�.O�K��?������(T&pbZ	Ws�&d*p\���'����{2d��������s{&N��J�_��p\>���(t�9._�i��p�����v�>��?A��Q�(�V	�N�O'C������nH�l@��������;������7G�_WvZ�<<G�A��6����TWu	��L?��	�~\g�G^���r-�bz��u`_�j�������,tX������+�Tqp��l�t%�lx�)p?H�U��K�Fnd���n��~o��'����}�Hg�_��)nR_�=Hu@X�J����w�|	���e�w���ho^_�t>7x��*@�6_��
�����W:�o��ht���spJ`w��p7���'�����Gw��H�y��v0���A�t^gy����s�WSx>ps�yM�o�cd����x��b���Cg��Ns�OpOW��w�������������������������������'��7��G��W��g��w���������������������g>����	�C7���,~	.hL�.�	`j���������������i��(�Zl�6��7��G��W��g��w�����G���<�v	L0-P������L����
2T����Nx��"F�
h�B3�L����8B�4����
O��Hq��-e6��!`�r$��4?R������
A-�i)��Z�.��J��H�:�&�*����Q���i���crm;�!��v��=*����|.��P���Z)@�����(��)b��J����M�,�2���
Up�4���;���i��h�;����:���r'��NIH�������y���6)����l���2h	�&v��;�_������/���|���e��)��hU�V�&t��d��`'�y��k��^q����-�!�1��|��jQ�y"a�@�qro�"!�>qp�&%�O0���	����	���4�H���C��pLJ�1ve'
�cF[��G@���`f"�f'Y��$�Qr��KV�7c6#�ER����-���0h�|��_�Yh��V��VY���nX��iq��V(@H!D��	��%����j�O�T���tZ�[i�*&��It�l'4�XA�3������_�Y�-���:���&K����P�R�:�#��.��Q;�	�����ja�fW�!�����	o��S�*�p�[ AB
��a(���d�v�
s�a��*���j6�D@�,W?50@�!o�1OQ��2�d�$1��
�P�R�w-�sRG��p'E'�	��"���a�,��	i�l��qg��O+|va���jwh�%���U��	�'��ka]���x����V�����8	`B��z.�RVj�����R!��D��%�-w�
1;�����x"��IV�s���e.��
��T�U(x�|i�[����b�h+��hG��<��Y��v@@"� 0�v�$�oa@�T�4�)mA���=����m�I�|-N��g������/t�C�U�� !�KN�&�����6��_��SNp'�k���7�qP�&�7v�/A)a	x .������v�(H�x���x�*�U�("-����S�_��C�1��I��U�
pDE�
��W�+ ������A�-�!i$H��8#2�1)o@����p��xK�8#�q"0�H��#���@��
#)�IR����$&3��Mr����$(C)�Q����<%*S��U����|%,c)�Y�����%.s��]�����%0�)�a���<&2���e2���|&4�)�iR����&6���mr����&8�)�q����<':���u����|'<�)�y�����'>���}����DH@;
v12-bufmgr-lock-improvements.patchtext/x-patch; charset=UTF-8; name=v12-bufmgr-lock-improvements.patchDownload
From bd214faa0cf0d9574c73d7851d1c0e6fd54bb429 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 27 Jun 2022 17:24:00 +0300
Subject: [PATCH] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that
hurts at high concurrency level.

There is no need to hold both locks simultaneously. Placeholder entry
could be inserted first both to reserve place in buffer table and to
inform other backends we're going to actually allocate this buffer.

Then other backends could wait on ConditionVariable associated with
backend doing actual work.

Buffer's tag backend is working on is saved into static variable, so
if backend is interrupted for some reason, it will remove placeholder
and wake up waiting backends.

Some ConditionVariable improvements are maid for performance reason:

- allow to skip ConditionVariableCancelSleep(void) with using new
  ConditionVariableSleepOnce(cv).
  CVSleepOnce could be not used in loop, since it doesn't re-enter
  process back into CV. Therefore it has prerequisite of already made
  CVPrepareToSleep(cv).

  It is save to use CVSleepOnce in our case since we're going to retry
  with fullblown LWLockAcquire+BufferLookup, and most probably it will
  succeed. Using CVSleepOnce save's one interaction with CV's spinlock,
  which is quite huge deal on contention.

- fetch and wakeup processes with ConditionVariableBroadcastFast(cv).
  Iterating one-by-one makes a lot of synchronized atomic writes on
  CV's spin lock, which harms performance significantly.
  And since awaking processes in batch doesn't harm correctness, lets
  do it.

---
 src/backend/access/transam/xlogprefetcher.c   |  10 +
 src/backend/storage/buffer/buf_table.c        |  62 +++--
 src/backend/storage/buffer/bufmgr.c           | 214 ++++++++++--------
 src/backend/storage/buffer/freelist.c         |  13 +-
 src/backend/storage/lmgr/condition_variable.c | 122 +++++++++-
 src/backend/utils/activity/wait_event.c       |   3 +
 src/include/storage/buf_internals.h           |   3 +-
 src/include/storage/bufmgr.h                  |   2 +
 src/include/storage/condition_variable.h      |   2 +
 src/include/utils/wait_event.h                |   1 +
 10 files changed, 312 insertions(+), 120 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 959e4094667..84274e60f97 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -782,6 +782,16 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				block->prefetch_buffer = InvalidBuffer;
 				return LRQ_NEXT_IO;
 			}
+			else if (result.concurrent_io)
+			{
+				/*
+				 * Could it happen at all, i.e. could be there concurrent
+				 * backend trying to load same page? Anyway, try to do
+				 * something meaningful.
+				 */
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_NO_IO;
+			}
 			else
 			{
 				/*
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..81d27a3b8d1 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -23,15 +23,17 @@
 
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "miscadmin.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
 {
 	BufferTag	key;			/* Tag of a disk page */
-	int			id;				/* Associated buffer ID */
+	volatile int id;			/* Associated buffer ID */
 } BufferLookupEnt;
 
 static HTAB *SharedBufHash;
+ConditionVariableMinimallyPadded *BufferInsertionCVArray;
 
 
 /*
@@ -41,7 +43,29 @@ static HTAB *SharedBufHash;
 Size
 BufTableShmemSize(int size)
 {
-	return hash_estimate_size(size, sizeof(BufferLookupEnt));
+	Size		sz;
+
+	/*
+	 * BufferAlloc inserts new buffer entry before deleting the old. That is
+	 * why there is a need in additional free entry for every backend.
+	 *
+	 * Also size could not be less than NUM_BUFFER_PARTITIONS.
+	 *
+	 * And since get_hash_entry is not very  inefficiency of dynahash's
+	 * get_hash_entry, it is better to have more spare entries, so we use both
+	 * MaxBackends and NUM_BUFFER_PARTITIONS.
+	 */
+	size += MaxBackends + NUM_BUFFER_PARTITIONS;
+	sz = hash_estimate_size(size, sizeof(BufferLookupEnt));
+
+	/*
+	 * Every backend should have associated ConditionVariable so we could map
+	 * them 1:1 without resolving collisions. And additional one is allocated
+	 * for startup process (xlog player), which have MyBackendId == -1.
+	 */
+	sz = add_size(sz, mul_size(MaxBackends + 1,
+							   sizeof(ConditionVariableMinimallyPadded)));
+	return sz;
 }
 
 /*
@@ -52,6 +76,10 @@ void
 InitBufTable(int size)
 {
 	HASHCTL		info;
+	bool		found;
+
+	/* see comments in BufTableShmemSize */
+	size += MaxBackends + NUM_BUFFER_PARTITIONS;
 
 	/* assume no locking is needed yet */
 
@@ -63,7 +91,18 @@ InitBufTable(int size)
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION |
+								  HASH_FIXED_SIZE);
+	BufferInsertionCVArray = (ConditionVariableMinimallyPadded *)
+		ShmemInitStruct("Shared Buffer Backend Insertion CV",
+						(MaxBackends + 1) *
+						sizeof(ConditionVariableMinimallyPadded),
+						&found);
+	if (!found)
+	{
+		for (int i = 0; i < MaxBackends + 1; i++)
+			ConditionVariableInit(&BufferInsertionCVArray[i].cv);
+	}
 }
 
 /*
@@ -110,18 +149,17 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
  *		Insert a hashtable entry for given tag and buffer ID,
  *		unless an entry already exists for that tag
  *
- * Returns -1 on successful insertion.  If a conflicting entry exists
- * already, returns the buffer ID in that entry.
+ * Returns pointer to volatile int holding buffer index.
+ * If table entry were just inserted, index is filled with -1.
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
-int
-BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
+volatile int *
+BufTableInsert(BufferTag *tagPtr, uint32 hashcode)
 {
 	BufferLookupEnt *result;
 	bool		found;
 
-	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
 	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
 
 	result = (BufferLookupEnt *)
@@ -131,12 +169,10 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
 									HASH_ENTER,
 									&found);
 
-	if (found)					/* found something already in the table */
-		return result->id;
-
-	result->id = buf_id;
+	if (!found)
+		result->id = -1;
 
-	return -1;
+	return &result->id;
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ae13011d275..e5d79737f2e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -166,6 +166,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
+/* tag of speculatively inserted placeholder */
+static BufferTag PlaceholderTag = {{0, 0, 0}, 0, 0};
+
 /*
  * Backend-Private refcount management:
  *
@@ -506,7 +509,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 					 ForkNumber forkNum,
 					 BlockNumber blockNum)
 {
-	PrefetchBufferResult result = {InvalidBuffer, false};
+	PrefetchBufferResult result = {InvalidBuffer, false, false};
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -528,7 +531,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 	LWLockRelease(newPartitionLock);
 
 	/* If not in buffers, initiate prefetch */
-	if (buf_id < 0)
+	if (buf_id == -1)
 	{
 #ifdef USE_PREFETCH
 		/*
@@ -539,7 +542,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 			result.initiated_io = true;
 #endif							/* USE_PREFETCH */
 	}
-	else
+	else if (buf_id >= 0)
 	{
 		/*
 		 * Report the buffer it was in at that time.  The caller may be able
@@ -548,6 +551,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 */
 		result.recent_buffer = buf_id + 1;
 	}
+	else
+	{
+		/* Other backend is in process of insertion. */
+		result.concurrent_io = true;
+	}
 
 	/*
 	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
@@ -1114,13 +1122,17 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	ConditionVariable *insertionCV;
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
 	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
 	uint32		oldFlags;
 	int			buf_id;
+	int			backId;
+	volatile int *buf_id_p;
 	BufferDesc *buf;
 	bool		valid;
+	bool		wait;
 	uint32		buf_state;
 
 	/* create a tag so we can lookup the buffer */
@@ -1131,6 +1143,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	newPartitionLock = BufMappingPartitionLock(newHash);
 
 	/* see if the block is in the buffer pool already */
+retry:
 	LWLockAcquire(newPartitionLock, LW_SHARED);
 	buf_id = BufTableLookup(&newTag, newHash);
 	if (buf_id >= 0)
@@ -1173,9 +1186,58 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
-	 * buffer.  Remember to unlock the mapping lock while doing the work.
+	 * buffer.
+	 *
+	 * First, we insert placeholder to indicate we're working on (or to find
+	 * someone else is working). So re-lock partition in exclusive mode.
 	 */
 	LWLockRelease(newPartitionLock);
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+
+	buf_id_p = BufTableInsert(&newTag, newHash);
+	buf_id = *buf_id_p;
+
+	/* During startup MyBackendId == -1, so workaround. */
+	backId = MyBackendId > 0 ? MyBackendId - 1 : MaxBackends;
+
+	if (buf_id == -1)
+	{
+		/* Ok, we are first, who tried this buffer. Mark with our backend id. */
+		*buf_id_p = -2 - backId;
+		/* And remember we're inserting it for cleanup procs */
+		PlaceholderTag = newTag;
+	}
+	else if (buf_id < -1)
+	{
+		/* Someone else is trying to insert this buffer. We should wait him. */
+		insertionCV = &BufferInsertionCVArray[-2 - buf_id].cv;
+		ConditionVariablePrepareToSleep(insertionCV);
+
+		/*
+		 * buf_id_p is finally written further without holding partition lock.
+		 * So backend which inserted it could already call CVBroadcast before
+		 * we enqueue ourselves into CV. Therefore, we must recheck buf_id_p
+		 * content now.
+		 *
+		 * buf_id_p is still valid pointer since it could be deleted/reused
+		 * only with exclusive partition lock.
+		 *
+		 * We are relying on 32bit atomicity here as in several other places.
+		 */
+		wait = *buf_id_p == buf_id;
+	}
+	LWLockRelease(newPartitionLock);
+
+	if (buf_id < -1)
+	{
+		if (wait)
+			ConditionVariableSleepOnce(insertionCV, WAIT_EVENT_BUFFER_INSERT);
+		else
+			ConditionVariableCancelSleep();
+	}
+
+	if (buf_id != -1)
+		goto retry;
 
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
@@ -1283,7 +1345,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * exclusive lock on old mapping partition.
+		 *
+		 * Note: we don't need to have the lock on new partition since we're
+		 * effectively locking placeholder entry.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1296,93 +1361,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1390,20 +1378,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
@@ -1429,16 +1413,24 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	else
 		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
+	/*
+	 * need to write buf_id_p before unlock buffer header since BM_TAG_VALID
+	 * means BufTable has an entry pointing to this buffer header.
+	 */
+	*buf_id_p = buf->buf_id;
 	UnlockBufHdr(buf, buf_state);
 
+	/* We're done. Forget about placeholder */
+	CLEAR_BUFFERTAG(PlaceholderTag);
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		LWLockRelease(oldPartitionLock);
 	}
 
-	LWLockRelease(newPartitionLock);
+	insertionCV = &BufferInsertionCVArray[backId].cv;
+	ConditionVariableBroadcastFast(insertionCV);
 
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
@@ -2576,16 +2568,50 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	return result | BUF_WRITTEN;
 }
 
+static void
+BufferCleanupPlaceholder(void)
+{
+	uint32		hash;			/* hash value for newTag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			backId;
+
+	if (likely(PlaceholderTag.rnode.relNode == InvalidOid))
+		return;
+
+	hash = BufTableHashCode(&PlaceholderTag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* During startup MyBackendId == -1, so workaround. */
+	backId = MyBackendId > 0 ? MyBackendId - 1 : MaxBackends;
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	/*
+	 * Just last sanity check. should always be true. I don't know how to
+	 * trigger this in tests, so no Assert here.
+	 */
+	if (BufTableLookup(&PlaceholderTag, hash) == -2 - backId)
+		BufTableDelete(&PlaceholderTag, hash);
+	LWLockRelease(partitionLock);
+
+	ConditionVariableBroadcastFast(&BufferInsertionCVArray[backId].cv);
+
+	CLEAR_BUFFERTAG(PlaceholderTag);
+}
+
 /*
  *		AtEOXact_Buffers - clean up at end of transaction.
  *
  *		As of PostgreSQL 8.0, buffer pins should get released by the
  *		ResourceOwner mechanism.  This routine is just a debugging
  *		cross-check that no pins remain.
+ *
+ *		Except for sanity check process didn't fall during buffer
+ *		insertion. Then it should delete placeholder.
  */
 void
 AtEOXact_Buffers(bool isCommit)
 {
+	BufferCleanupPlaceholder();
 	CheckForBufferLeaks();
 
 	AtEOXact_LocalBuffers(isCommit);
@@ -2628,6 +2654,7 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
+	BufferCleanupPlaceholder();
 	AbortBufferIO();
 	UnlockBuffers();
 
@@ -3367,6 +3394,7 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
 		buf_id = BufTableLookup(&bufTag, bufHash);
 		LWLockRelease(bufPartitionLock);
 
+		Assert(buf_id >= -1);
 		if (buf_id < 0)
 			continue;
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aaec..aa440e077f3 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -454,8 +454,8 @@ StrategyShmemSize(void)
 {
 	Size		size = 0;
 
-	/* size of lookup hash table ... see comment in StrategyInitialize */
-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	/* size of lookup hash table */
+	size = add_size(size, BufTableShmemSize(NBuffers));
 
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
@@ -477,15 +477,8 @@ StrategyInitialize(bool init)
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
-	 *
-	 * Since we can't tolerate running out of lookup table entries, we must be
-	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(NBuffers);
 
 	/*
 	 * Get or create the shared strategy control block
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index de65dac3ae0..60f63762a31 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -30,6 +30,9 @@
 /* Initially, we are not prepared to sleep on any condition variable. */
 static ConditionVariable *cv_sleep_target = NULL;
 
+static bool ConditionVariableTimedSleepImpl(ConditionVariable *cv, long timeout,
+											uint32 wait_event_info, bool once);
+
 /*
  * Initialize a condition variable.
  */
@@ -97,8 +100,15 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-	(void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
-									   wait_event_info);
+	(void) ConditionVariableTimedSleepImpl(cv, -1 /* no timeout */ ,
+										   wait_event_info, false);
+}
+
+void
+ConditionVariableSleepOnce(ConditionVariable *cv, uint32 wait_event_info)
+{
+	(void) ConditionVariableTimedSleepImpl(cv, -1 /* no timeout */ ,
+										   wait_event_info, true);
 }
 
 /*
@@ -111,6 +121,20 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 							uint32 wait_event_info)
+{
+	return ConditionVariableTimedSleepImpl(cv, timeout, wait_event_info, false);
+}
+
+/*
+ * Wait for a condition variable to be signaled or a timeout to be reached.
+ *
+ * Returns true when timeout expires, otherwise returns false.
+ *
+ * See ConditionVariableSleep() for general usage.
+ */
+static bool
+ConditionVariableTimedSleepImpl(ConditionVariable *cv, long timeout,
+								uint32 wait_event_info, bool once)
 {
 	long		cur_timeout = -1;
 	instr_time	start_time;
@@ -134,6 +158,9 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 	 */
 	if (cv_sleep_target != cv)
 	{
+		if (once)
+			elog(FATAL, "ConditionVariable could be waited once only if it is "
+				 "prepared to sleep.");
 		ConditionVariablePrepareToSleep(cv);
 		return false;
 	}
@@ -184,7 +211,8 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
 		{
 			done = true;
-			proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
+			if (!once)
+				proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
 		}
 		SpinLockRelease(&cv->mutex);
 
@@ -199,7 +227,11 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 
 		/* We were signaled, so return */
 		if (done)
+		{
+			if (once && cv == cv_sleep_target)
+				cv_sleep_target = NULL;
 			return false;
+		}
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (timeout >= 0)
@@ -362,3 +394,87 @@ ConditionVariableBroadcast(ConditionVariable *cv)
 			SetLatch(&proc->procLatch);
 	}
 }
+
+void
+ConditionVariableBroadcastFast(ConditionVariable *cv)
+{
+	int			pgprocno = MyProc->pgprocno;
+	bool		have_sentinel = false;
+#define BATCH_SIZE 16
+	PGPROC	   *proc = NULL;
+	PGPROC	   *procs[BATCH_SIZE] = {NULL};
+	int			nprocs = 0;
+
+	/*
+	 * In some use-cases, it is common for awakened processes to immediately
+	 * re-queue themselves.  If we just naively try to reduce the wakeup list
+	 * to empty, we'll get into a potentially-indefinite loop against such a
+	 * process.  The semantics we really want are just to be sure that we have
+	 * wakened all processes that were in the list at entry.  We can use our
+	 * own cvWaitLink as a sentinel to detect when we've finished.
+	 *
+	 * A seeming flaw in this approach is that someone else might signal the
+	 * CV and in doing so remove our sentinel entry.  But that's fine: since
+	 * CV waiters are always added and removed in order, that must mean that
+	 * every previous waiter has been wakened, so we're done.  We'll get an
+	 * extra "set" on our latch from the someone else's signal, which is
+	 * slightly inefficient but harmless.
+	 *
+	 * We can't insert our cvWaitLink as a sentinel if it's already in use in
+	 * some other proclist.  While that's not expected to be true for typical
+	 * uses of this function, we can deal with it by simply canceling any
+	 * prepared CV sleep.  The next call to ConditionVariableSleep will take
+	 * care of re-establishing the lost state.
+	 */
+	if (cv_sleep_target != NULL)
+		ConditionVariableCancelSleep();
+
+	do
+	{
+		/*
+		 * Fetch processes from proclist in batches.
+		 *
+		 * If there were less entries than batch size on first iteration, we
+		 * will done immediately. Otherwise we insert sentinel entry do detect
+		 * part of list we are responsible for.
+		 *
+		 * Notice that if someone else removes our sentinel, we will waken up
+		 * to BATCH_SIZE of additional processes before exiting.  That's
+		 * intentional, because if someone else signals the CV, they may be
+		 * intending to waken some third process that added itself to the list
+		 * after we added the sentinel.  Better to give a spurious wakeup
+		 * (which should be harmless beyond wasting some cycles) than to lose
+		 * a wakeup.
+		 */
+		SpinLockAcquire(&cv->mutex);
+		if (!have_sentinel)
+			/* While we're here, let's assert we're not in the list. */
+			Assert(!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink));
+
+		while (nprocs < BATCH_SIZE && !proclist_is_empty(&cv->wakeup))
+		{
+			proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
+			if (proc == MyProc)
+				break;
+			procs[nprocs++] = proc;
+		}
+
+		if (!have_sentinel && !proclist_is_empty(&cv->wakeup))
+		{
+			proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
+			have_sentinel = true;
+		}
+		else if (have_sentinel)
+		{
+			have_sentinel = proclist_contains(&cv->wakeup, pgprocno, cvWaitLink);
+		}
+		SpinLockRelease(&cv->mutex);
+
+		/* Awaken waiters batch, if there were some. */
+		while (nprocs > 0)
+		{
+			proc = procs[--nprocs];
+			SetLatch(&proc->procLatch);
+		}
+	} while (have_sentinel);
+}
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 87c15b9c6f3..e121a301409 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -331,6 +331,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_BTREE_PAGE:
 			event_name = "BtreePage";
 			break;
+		case WAIT_EVENT_BUFFER_INSERT:
+			event_name = "BufferInsert";
+			break;
 		case WAIT_EVENT_BUFFER_IO:
 			event_name = "BufferIO";
 			break;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index a17e7b28a53..5f2d73c6773 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -231,6 +231,7 @@ typedef union BufferDescPadded
 	((LWLock*) (&(bdesc)->content_lock))
 
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
+extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferInsertionCVArray;
 
 /*
  * The freeNext field is either the index of the next freelist entry,
@@ -327,7 +328,7 @@ extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
-extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
+extern volatile int *BufTableInsert(BufferTag *tagPtr, uint32 hashcode);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 58391406f65..a11d04290f1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -53,6 +53,8 @@ typedef struct PrefetchBufferResult
 {
 	Buffer		recent_buffer;	/* If valid, a hit (recheck needed!) */
 	bool		initiated_io;	/* If true, a miss resulting in async I/O */
+	bool		concurrent_io;	/* If true, other backend is trying to load
+								 * page at the moment */
 } PrefetchBufferResult;
 
 /* forward declared, to avoid having to expose buf_internals.h here */
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index e89175ebd5c..76a7330c132 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -57,6 +57,7 @@ extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info
 extern bool ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 										uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);
+extern void ConditionVariableSleepOnce(ConditionVariable *cv, uint32 wait_event_info);
 
 /*
  * Optionally, ConditionVariablePrepareToSleep can be called before entering
@@ -69,5 +70,6 @@ extern void ConditionVariablePrepareToSleep(ConditionVariable *cv);
 /* Wake up a single waiter (via signal) or all waiters (via broadcast). */
 extern void ConditionVariableSignal(ConditionVariable *cv);
 extern void ConditionVariableBroadcast(ConditionVariable *cv);
+extern void ConditionVariableBroadcastFast(ConditionVariable *cv);
 
 #endif							/* CONDITION_VARIABLE_H */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index b578e2ec757..4120c45de41 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -86,6 +86,7 @@ typedef enum
 	WAIT_EVENT_BGWORKER_SHUTDOWN,
 	WAIT_EVENT_BGWORKER_STARTUP,
 	WAIT_EVENT_BTREE_PAGE,
+	WAIT_EVENT_BUFFER_INSERT,
 	WAIT_EVENT_BUFFER_IO,
 	WAIT_EVENT_CHECKPOINT_DONE,
 	WAIT_EVENT_CHECKPOINT_START,
-- 
2.36.1

1socket1.gifimage/gif; name=1socket1.gifDownload
1socket3.gifimage/gif; name=1socket3.gifDownload
GIF89aX��###+++333;;;CCCKKKRRR]]]eeekkksss|||������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��K	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����+^������#K�L�����3k���������&��pl� � �����nA��f��3�z�@�FP+����s.�G��yztp�F'h@�w����@�t��Sk��`�
�F}@����{?3����G���@�p���%7Z�TA
p�l�
�@	m��@���@��F�#M�@�
t@P��F�@�l�@����%����@�X\#QP^�@Y�NF0�o�	jx1�#��x^���x���fc���tE�Z,�	T����8�rD7������h�6P�`��4 ��6� ��� �����	@o�=H�xc)��P��]���`n!���	����E�J
��9ZJ��D��F���K�'�b�J��9��b������@
�s��7e?d+pk�1������FmA*7�X)/j
T.4/���r�t�o�������A���A�
�2_��w	(��Q��Z)	.t�r5�rB�,�B�~���,�Z(�l��L6���M)�����3���}H�������	�� ������
wQA�)�����Z�
��+
�,��f7.W�9��
��7�^�8����0�-��m�M7��	r
Jl���83�5��A���oA^
��*��h�{m�'�N�6��Np �.�����{�b	��V�E#P~9\xR����n�Z��b=�L��|]����(���~(�{p����x
P��v�K�Y�f��(�+�(����h��.w/L�������)n]0`�����O �Ty$��v
`��"H�(R�`��V4��mP�3��"P1�H ;0���x���F-�����.z��`��H�2���hL����6���p���H�:���x�����=����!)HB�p!����F:����2X�b�$�`�N�a_��(G9�>���~)���V����8��7��@����(f"�@�6E����L�2�	���rL ���l�����~�(��*fC� �r����eIfi�gN-6���D~@�z���?�p�5��?j���<��&�j��Lh+q��m"����.0 Pj��pY�"�����e�T�������p��6$N�&�Rdt��({+E����9�P��tUp@��d�!��j=)r�a��=HU���.��0�����2�$��f4T����Z���,�qM� )CN���Bd�h�@�*����-aG����]j+�*�BEg��V�n�M��#���@�Al�~�RV���h{���U�g=) V���<^dC�V ;����\
��� ����@U��%���0r@����6nl�lBD�������;�m��]�����������8o)�����H�O3+�P$�����Ht<���U����\�jk���/��$��S�QM���T`����}�'o�6���z��*d6]���#ZM]����W/x �5���p���:@K� `s�"���i��	V���D�k��XVr@}����!_>H�l����q!K�����2���<��"��S@���~Q���e�+g�U\�3��4��mK"(�j��L�K���u��E95��b%�{,��d�s"�W
��xU�<�m![i�M����� 1�a�C�x����w#	���9&�QZ8�	���L�@v���d{ O�����QqC��H,�b�C�`�	v)��!c�*VZ6���+GE�v�z���oO<uH������~��0�6��
��`��q,��NM@����A(��Jk�t���I��(�@V{�|�����"6���{��tsAP���-n�.��c���������[c��Q��)��"]��\ �N((�
�������&��gGWH�}v^.p!o����p�<���4�Yj�J)U���� o[^���v<�
@�P��������q/����n��-o6)�j^����$�6"XO��	��.�	����B������ 	(
�����W�!rg����K\�`��a�(�������wOgD�g���Yy\ya�yQ}��9s��k�wz���vw�.��!��T��)�W
��Z@�W��u|���z����2��W���G��m�ox�YG�@�v��*w~��c(A�pTC�-'a}����V��Z���r�r�<b	��7U
�3��6x��&�
\�G�����klGxs9�yy�#�����`;	`eqC�d�Zh���A
����a.�|6u��e��
�^-������/���zHk�3=\�l�8�0e�$BZ��D�`����.���[���j����m�X�
 ,G8��0��T�������e7�P#�x�����1�����n�	g~��0�TM������+�K�#�}!=��~�L��#�2�	��.��Bv1�F���6��a�t38�!yb`p�4~
0�I��[�X
�!�P��T�����~�D���X�	+E�����h}3b"8�2n�koA
W�u�g��C�;�Js���j�����PQ�AUi����Z����^�`�N �c�JXYV3�
�Z��I��H�WY
)B
�	��)�}�����0
}3
� ���Z���`��N���^1
�I������~9��Y��y����������������9��Y��y����������������9��Y��y�����������Y�J��	�~NT���4�Yg�Q7C �����p�W����".��u@��c���{V��0��%W%{�b���@\�6E
��`SjX>�*�s�N�:Jvk!U�&�	� gF&�0�xfs3����1��o!UK�� ��C�C�	#�
�[`�

�o�N���Cz3P�R�n!U{�KpOK�8�	��=jO��s��D	������������*�����z��|��~�����:���Z�8�O=��`�p9pZ��9�#�`{zQ�������:��Z��z�������������	V�`��!&^���p�/��Z��*<�O%��q�	��z�\*Ue��0y��j�Fq���%�Id��z ��&������
z�"�<�:PN����	(�y������pO��a1e��C�Y%�g��)[�O}�	� �kOK���	":
���^A
<�[
% T���d�++) ��	;�0�����cZO]���Q�$;G)[
�Nd���_p�t_����cpO�P��	����baz�m��dP�pNg`u��b��T* ��	!	�D�&PNq����>�{��	�d[6�*_��
6[��g�S( T����O�0��	�@�`�������@����;
*P"1
���}�����J��47: �Q���������	c����`J���y�0�yb`�+T%����C����Y
���:�����?��[��Y�"��&�A%	��E0�t;�/1
~�OK�{�����;�UK�Q������[�ot�� ��R��!	�[N(0�1
���O���+����!����K�	�j�R�D�F��y�z���}�}0R�	T� ��X�����\� j��t�q���2�����g �(����*PN3{�
���������{P�]��c[
{��@���T�V�����;{LO��TL�R<�NlJ���I|��U���#��L��C�	�[�
�����1�����GP
���9a�g�2Kt<�|N1
����<�������L��������7\�8Ub���{��	�R%�P�Z
�(&�������L�]<V�O+-�+%��>����g������	������	���<���!�@+�\����q������nXl�kJ����|�J��������b���N1
.���?�������	�������P�>�b���	c��1
�����1�@�k����P�����,J6-ZZ
�+_���\�lOz����	G���L<����F������<�Bl�B1
�����mCmO���=*Z:
�;� T)��(pk�[����_K���A������	���]�&F
�[%l�	a������}�R��c|O������{O-�K	hk���SK����2���MO�M��7��������h��0��+�i��F������F��mA�v,*��1�C!��������]����� 	�����a�E��wA���BE�~F�_�4�B��a�% �~��������H h���(>�3N�ya�%0��������		�,�g��@�Kp"���a�"�;�bP�E�����g��ONK �|�������"����]N%^�m!�7Z
�J��?.k�%���A��x~�MO>@��!	i����N��{,CR�,[�f��`� �`pT��)�j0��N� U
��>��)	�P�1+�4�A�(������CB��d�	1�g�q;������7��@��?�oB�EP����_P�(�+T���P���A���_��i���	�~.nl����:��d��E��^(P`pj��0���A���>A0
���c���k�	��d�*��%�_��I�	un��~}��4A��E����b���������k��M����d�$��n4��	z������		nk�3����u^B���Lo�_N������s[��W�n�?P��0[]��~�%\E@�0Tw~�Y;�.L�V~N%����w���^�E0�a_�>�G	B��+�+������i<I�c��!�������K[NDn�"k�!��v�����oN�_����t��9
F���o�K��������?`��������|�������FP����Bu����������8_?����{Y���*@��O�o4
<`�QJ�@�
D�P�B�
>�Q�D�-^��1! K�j��%:� �$�G]�8�Q�L�5m����(�=}�T�P�yVI�#��/]�Uj�H?0M��U�V�]1&U���$�e���1��Z�m���8JlG0�����1���}��t.��<:Xq�P>�,�Y�����*MX_TR��U����M��j��#���9�f%j�6���T[�n�7/��$��#����4��*�@A�R�B����S����P<�pM����M���_<�Q�(�D2�)��@_�
�N
J��� ��)��-�PB
#�.��J=]�>G�1
;A�H����������&X��cp�	H�~rH K�-4~d+>��7��r��
�5�!`�%2�G�x@ @S��L��5��E�#���CQh������JA�h�0@��	����L��,���RL� ��*5TQG%�TSO%5�,@�UW_5O[h��2�XKe�v6Xa�%�Xc�E6Ye�e6��\
�,`Y�� ��@���4�"�WH	���[���a��duaK&��5fx�Cj`<(�d��&��BFx�3�2x������/��$P�$91M�r0F
(��< 3�R�8�0���VnYePXfkH�J�|�LB!�.��8�����]x��5��cY�H8j�B�)�rX
�;�
H,��� �@�R(��E�D�k;������T�!��p�^��HQ,��6����I.��;v�cj�C#��(n�Q\#��8��4�kc��.�GxX�+dX<����de��$d�7�7��5���r����=��)1��V�8�����D�@G��6~|�$��$�������1��:~�����W�N!D'lr�A
"�����6��}�6�D��7���a.��`Da�`�/X�A� ��`4��d�	bdO��l��A�"o�P�2���Hb.d��C�����DID	K���AVk%��;��2����R,J-b"���M����0Q�!JT�	��=���p������-
����w�l�4�&N$�!� ���T�x��������?D	�l"htXC
��CNN[�D��&���
�`@GN�zW���T9K��}��j��>x�G3�� "8
92�y��_t8��@�&:��j���q���C��%���zIRI��(<Y�P6P�l���$��Q"-/m�P�fb	��xP��F6��]����O4q�R���D�pR�B�J�D'��N���
��A)�%
^�%|GI��"O/��(L
�L����DP-q����3�Z��
����I?]@'n���Vx�3�`�!
A��F��)�:	!GO�tJu��FDq	������W4)B�@�G&;-�!������u�
j�/���I�Ad;����F�l15�R�Z��5�l���5��wo��P���G��B�<��C	��������!!���f6��R8���jx�S��b�����ID��imK��T�Y���T�X�$��w�D��W�Q�O�)��xT�n�
*%(�SA�����7x��v�y���:�A�����J(y3#O��_
��oBi�+�	�r�"�Pgr	�J��L��z��������qR���.�]AnXd//��g��������
3�Wq������^�
*�X�����K�j�pA��3\SGa�=��'@�y�2I�*"�$��'4�G�y��c|WL���|�Mm���*���F`�4��W_4��m�*+8���U�D�� CV�[6�f]0HD����.�	��&z����CQB��� �\(i�&9�	�GS�d�n��d��H;O�q��!4	(������]��{��LlkM��L:�	*�#2�p���y��C=��?9^�Hagv
�#�s�3^�X��
_��%�a_d�����}=-��m(ACxp)��?!y"Nq��avl�wDb���G�l�*�C���?�A�dB
�����	����J!,�z�ID����?�L(<�CH!��J�U.u��5K�:�-������&#4�����w�K�nu� ��m9g��
�����������[�xB�o	|/W�x�W�'�X�85B
Y�/%��E?���a�5WHx�g��R
� }�e/L������\p��1`�[4j�,�r��G>D��ULd�	�{%�Q�x��vv����}��"��t���D%����~��#�Y&���� R$�K�F	a|����~��p�9���#!����;H��<�	<*P1�@�����k*!����AD�>��R��*R(�����A�!!p���;�	��� �Nx�B����c���?��� tB�P ����h��)a�$�{�.���
�x�������B��4��P��p2� ��$*P�:D�4�������C8b";D�(����x�������FT�3P#�8�����S�RH"�s�M��!��+8"1B'b�R)��Tdt����41:���:U���H��-p�);E��[����-X!������at�����B�9���|�l���B�����<m$G�P� �B�(�x �#�����+�yd0$���=V�=`��lFz��DIhC�H�&�C�k:-��?�y�a��!�C�H�7���F�4����B���{�>$=\�%��FH\#�P�(pJ���$�|FOt��t�K@E����'dE�dIA�-t����T�@�8��+�`6�:82��C���bT���(��(��z�A����zGRA�|B1�:�-�=��B�3��sH��@�4F��<��[8u�����)��y���Rp�!���L���st�����s���,`�,L�H�`���R0�7�q�R8���tM�����G������N!*F�8*��CM��I���G0	`Mq9!)#	�#��4��p���N�C�#�8:N���/!1)��
�NJ��L��9��"
<���#��e�(K\��<
I�p���p�I��5�9��8�����`3�8J3�#5
Iq�n��@���U�$A�`���W�&��h�����-p=�� �!%�"����)��)�i��9��������q�
��m�5� ����,�B�;�e�yL��������|���HG�(xJ�����t
F�$��`���S�.=���(��AO��
T����J�8���FF B������H
"�dKs�RRu��S�,NU��b�;�EX]�|TD�DP+@�U��L� '��_������l&�#!�4V�C����zV����t�Qd�j�
�Vb����U��p����6� ��Is���LO���tW��Psr}��w��%P-8F��{�#�}�
H���X�vE������?��BH�N�J���
�lK'�O���<K.�F8-�������#g%Y���:EBh���O���R�E(=C��B��&hG��
fU"E�X���JU^�7��
mU�B ��rUy�Z��LZ���C���M��=[���������E%��4��������|=�=��4�&G��
�mF��,8]%�=��h]�,�[���M����B���Z��V4R���w�\�ZEX����"�E��\M=���Z%j�y-�{=�O�l�����%���Ep\n��S=P���m���f���%����&.*���3��E�u�=
�%�r-RH���^O=Y����|��]P�]�<�=�OP�%�]�X���C=O�h������u=��Z���������B1 ���]���O8��C[g�����D����^�U����� XT���<���`�CYE�C�
����O�>�=�C���a�-�N^Rl�a�����?M� ��_<b���N�(���s]�,�T�����C �j������2�����K^�0b8�<�k�'F>"1�U<����5��MR4_�s�/:U2d��\��>^��W�;�p�@ �j�d���5&�Dd
� �����p
eD�t;�#���:���Y�O����?b�������W`B\B��c�"!X<!0�N�Ig��k%!-H��
�!X<�mfnN�t=J��<X��,F�YtV������u��!���AX�t���:�N��e���B`y�A�D��X���Yyv����m��a�g�m&/V�D
���-�&Y�~B����<������$�#E�\�N��Fe�R�u������]�h�i�jr�cq�0��FK	.�I�i���������5]�k�H��6k����Vk��j�^k�N�����8��r���f��`����0����k4^��������	����
P���l$2H����)��1�����R�,m�����k�����d�8����Nr����Y��v�/	H��hF��7�O�,����m�M����3�P6��~PXnY�Q#����&��6��F���
P@Q�0�m-���6(	��>].XRPh�R� �qMR�R���y�)5����e4-�8�XS���6�������/�A�K���q���W�g�w������������������ �!'�"7�#G�$W��%g�&w�'��(��)��*��+��,��-��.��/�0�1'�27�3G�4W�5g�6w�7��8��9��:��;��<��=��>��?�@�A'�B7�CG�DWt��E����MG����tU�ON��J�D���IGu9���Q'�R7�SG�T�m���M��W��Y��Z��[��\��]���F�����_o����pP v`��`�	(�)�	hu�pj7%��e7v�8LF��8Ilm5�d����p?P���p�d��m1�R���w��vq����wo���xv�v�f���SgGh�w�x����G
��	�&M�����)����)��������&0�@y��xiu��w�H�)��&�l�������h@��/��'z�x��l%�x�`z������y��z������O�����/���g�8�&�t�g{���@y�����t���{�/{&�{��m�H�p)|���aQ1��8�	y�Oyh���R�����o����������$���m��?��~���������|�}�����}�����}��|.�������L|������rQ��_����'�7��o�Rh�/~��7�/���w��	G����.P�Fp
�t�x��������]J=xP�T��u�!��
|A���J�`"H�4@@���R�lI�K)�=>4�eKG�����K�<Ub�Y�Q�����!D�L'J(�������
qe).	I:�`U�^������R���k���X���zP/_�^
Hx��``��T������m�����Y��|��C�� 0�q]�
x)�����7I�f����=����;�k��O���9�m��5��������I!D�h���{N�����G_J<x.*@D�}K�������;��oE4�����'�\�9��JS6�}��`)A��$7�z��7�]!.�(rh����[]�b�JP���k�h�U��L���8�F��1�$�K~��C=�L���Pp�U`�C<�%�g%&�Lv@|Nb��A�E��R�����������r8�v�AMNe�?�X�u�k\�!��F	���z5)�@��r	�W�
DP�'�`�jwLiyPs�y'��2�k],��_����@L�~�7Z�~GZ(�D@��>+���P��D��E�@(�f��A�����ME�� �(�k�{�h��{��]����H�N��vKb������J���kQ\���I�+�!�<2�%�|2�)��2�-��2�1�<3�5�|3�9��3�=��3�A=4�E}4�I+�4�M;�4�QK=5�U[}5�Yk�5�]{�5�a�=6�e�}6�i��6�m��6�q�=7�u�}7�y��7�}��7��>8��~8��+�8��;�8�8;
1socket10.gifimage/gif; name=1socket10.gifDownload
GIF89aX��###+++333;;;CCCLLLRRR]]]eeekkksss|||������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������!�,X��W	H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����+^������#K�L�����3k������y�����p�A�h�;$`��3�s{.���i��3�z�@B��}t��9��1��=������c,0����������<���#G�\|L��5�~p}{�F�`J������Lg'}X�
��j���a���@xQ��|p�D	�6@��������k��@ ����su���jE
jlh�`�@%����LR�=h��5��h��@O�6��Pb�����s�d�%�-�u�Z�W&���r�$ASv�J8yk��#A��&u�U���xz�@�V\�4 ��$�9@��0Z'��:Z�M������(l��$A
������
A�g�
�%(ga
�w���A��8���hJ���J�w*vY,j��2����k��(���bJ�R�V�@���f��*P0 �@�)Pf�����R�v�&�oAJ�����|��!,���*j.����]��B�5G�A�PFf��!u���@���k��m!M���%��~�:Zk��B97��@�d�s���\A���+�p�%5A[^�v(@���q�B��7�?�R��������
�.���-�qW���D	P=���<�h|J����3���/jD/d���4:Q�)�Z�{Ys��A
(<�p������x;�	:�M���@�nJ�Q���7����
�
����>Xs����y��@�o���F�m�;j;;^v ���N�R��!�M"f�Z[A���� o}�
����,@Z8���V����A���v��;���"@�@7��dn��N��f���`�����	��9����&��d�]�����Q��|�E�qX�#<�����DD��q>��d��$Hp4�"���O7_�fS�����L�"��F:�����$'I�JZ�����&7��Nz����(GI�R����L�*W��F����|,%�;�����%l�?h%n���6���tC*�"AB�h&4�)�@�r ���6���n>�$h��D!J0@@���G#�	d>6��������'*B����F�L\��`�
h�y�Y���!n�D'J��.�$��H?��m= }�HGJ��
��e�*����x�����r�'9��$23�!�f8 ��gP�R4��	�Y��H*�IU|�s"3�|&T�$�2R�Y��&u��hj �
��V����N#�Sd�(C�.����#�Xk&���q���k^�Z��+�6�)NM����AH6H����"j�i�P�&���2d�U��U!rY@9��Hg	���j�JEg�Y�����i�V��U��]cf���	
��`+���*A���@���5� e���"+���(@q�t������3��i�_�pc
�uS��$�S�!^hM�!`,��|����*���gB����F���	���T`.pT1�/������-T������A��^�=�F�Sm��-���6�j:������H1����N���0psb)����(�x��$Q�i����j:Y�����Z@g8s8O4|K�p@�p�:���@�I,H[� �@��jV�,��Su����PG�U���s�24�����6mc��	ys�0g���z��(0�`l��
�gD�f�)3��b9�!W�2BD�*R7z�E��A��V�k���f�
�
�2�������N~Q*C�*������K���P�0���di�dHD+�CT�i
'����p�����n,�@��p�2/b�9���Z
q�*���yS"h���7�����^��=�q��P Js���
m#�����C-�h�wI7Y����o"��������
`z>U���rW������/�A��\7�P�����W�
�p�aj�<���)�%gw�,�<�r�:�������<C\n���;�czpLA����Z{�*����=����^7�JwH�K*��<��@���
4>!e��f��{ts]����W_��H���������B�C���0	J������(W�FSk��n��5�����/H���z�}��`@a�`z�'{m�r9f,v9z�E	��}�1�m�7�G~��~�7e����GR:f	�@2#h@8{$X&�u��f\}0�h��+��W�+�R!��\@�G���q�q*������^Ru��'�Eim�Sct`y�Q��A
�y8s?XD~KC@F�2�����Nxtd(��d
�p����"����V�Z��3%���%�
��|����nu(vA����e�� 0	f`$��v>�h�pF��\@$��(t\�pr�]��r����}���>����6u�8-��s�a\5T��V��	!�����y@Hv�3O�p�������
@i�'��9��2��4�(��'��z1]
���hu�B��A���M:Y�8	���'Yu��^�?��R'�%����#0u�R�N�
j�#���Wq�W�2W�8�V���� bN��d9�&����8Q�v�e9o2�N0��@ ��)�N5�N/�:�h��"$=���W�C��FI!9BGp�f�}3Qm;��A9oP�y���
���w�)����9���u;��T��W�������y?�iw��W���i��
X��I��
��������V����k�pQ�t
�	������)��9��Y��y�����������������9��Y��y����������������9��Y��y���������������G�<p R\s�	ZZ� R@�������V@+*��!w�b��`~�sr�%%�!"��0�|�`�����Ub'���=>
��5pl�&%0�@�%b)P��5�WZ���oX��,jRa�� B/��W��e`��1���]����
�����na�H0��Q�R�����D�`��	���&%�1�#%ljM���Q� =�\B:s���	Q������������:��Z���������0�$P�;�<��=��=�t�@��<�$���@��W`�������:��Z��z���������M���
Pv�A(h�#�pp�!�����q�	��Q�W���� 
��	�@��R�ZRHP�M���&���
��	*��)���Ps`�9ay��#��-[�o!
�J�)��p��k��%�@��	!�+R�*�>!h��+*J���{A%�`��&��@���[Rk��<�Fy����� ��U�����R0�"u��� �	�+�Hp�j�����WKRK��+{e�k���X����
h��NA@���{;�BM�Z�1��Z�p��%����
��+����U�����+�{��t��k�;������#�U��{���s�����]K�|�R0� {��;R���ea����+R��n��	��#��K���������W����#��;�	q�g;]���h[�<a�v����@��{�{�D[�����e�9����#�c�y���,,����� 

\RL�������H&,C�/��y����9����'�!���H�K���$�z�$�c����+x�%��Y{�Q+����P��+��H	�hLR���0M;��i
���&���)��4����F�	����6���P���Wj��@�P\J���\�y<R��
��_��{�����0:���1��%%�K��
���4�-<���u|E���KRQ � ��&u��������P�tkR�����%C�eL�\J�K���-*�,�7��L�\�&����`��{���"L�1�>?��B"��\����,\��Z���*�{�������3��%u�|<�z��,Rm���A��e��Z�����������-]�{�
�U	��q�����������-��k�5�%�����s�o�J - D��H��>[�)h��!Q�uU	-����N�-:�q�j=B+q�j�*�!s�um�Jq����A��{�@��p"P��P	+ J����:[�L� ����-i`	�}I���
B�L� ���q��i���M�<l�
����,���q	B����J�
"���
p��j��� ���}M���)`�-Qj�	�]Qj������p`�p�����$���n�A����������p��JB0�J�,��o��	�H�-���V]]��I��
k�/q
>���+�nI�
n�9"`�+�I9+
p��:�
i �]��$�P	:����q�D��(A���������;q
+��R�H�{��M�[�E��`�F��h>����p���p-��l�~q�����wn9B{o���*��5��b�D^��n/�>��k����K�W1�u��W��K �Yq��.��L�\�
-p���������Hp_��sMQ�}����0�^QV�����b��o���-)p�Za\����?N�p��v��%�p��h
��
����c]Rdq	pe	=�q����#3�/0���
����ph0����p)��n-Q?~	����!�$�"�@f7j�3��J��?j��&N��`�	o�����4j@�,F�t���r�5@��@���uq	�~_�0A���o�!
@9zZ}��}j��:���?`R��d_��z  Qco�l��n��p�r?����8r�""�j)E�(?���)a	a?RK�J/QM��(���
�����#��$��#��+�A�k��|�	K��)�"�Q��A���c�{�G("��op0�G��a�������(��<!�M����
W���"��6��q
�NQN�bqN�����/��9�V'0�������������Q��A����-LO��J�@�
D�P�B�
>�Q�D�-^��Q�F����R��������� �4"XT"S�L�5m���S�DDBv����
j0E�T�R�M�v<��������G%U�]�~6�G�egbR#bL�m����aO�u1Vj!"�%�}����������������,5Y���B>Q�<�T��[b��6�h����i3^JX��W�l��G�=�	��U�\xC��
W�
��r���n�yr�qNE��=��6������$����{G�^lQ�@��g����Wa��W�
6���0@�P��<��m��J��7�k�
���8��C�� �/;DP�\(�J�P���B��:Eo�Q !���D�Q�J� O
9T���d����A(��#Be�9��r4L���:�$��B��(���H�a�=��H���Q�.��2���=��0P�(m0�N�c�d8P���#�PT1H�5�+��4�����-T��A=Wy�G�RQE?��BHa�A:�00�D:B������$A[�9���5�5Z���+��/�9t�#�TP���;����=���S.���;�J�E��O����Ki��+��
�V���5�hT�^#�A����T0EH�(91%��0�K� �58N�2��T�_��"�,M�C_����NU�D�=����Hs����JQ���T$r?5�(���X9���<o�)CI���TB�$q��9�uI%?�X`��cA�Yj�iJ%������;l��d��X��X���#��H:�;QTs����x����'8��*<m�:3��l�L���WIw����2�Q��T!
8��(�3�� ��JM��o�'������[m��R��|7,y{�D�5�m�
9��(�$�h�wi��['M#|�T����N���'JeQ<tx��(C�����(� �8E%�����)�
��QrV`>���T8���rF<��[#ZS� �L��|IE�`p���AI��������`
r��)�@�BP�w>@SN��}ubM2�D"�@�G��a�C�Ta���HQ�%`���A

L�|�9��dt>b��T�%`���t�?\5�R4�R�HL#`�	F6F����71�����P�m�V���>�)N$�R�>pd��qC���;BF�R�I*�%5? ��!W�����J�BC��\7�J��#�9AN�%`e��#�J��S�@�A������RJA�)�66��h���F:�+%��t�4�^~J�^P��@����!�r���DT�>�W6t)n��f�%��:����=�!�HW(Pa�J� ���B,�"�(�:!yx��E&$�z�H6*�����0�tj�-lA����C�0(��#k��������a�C��:U���B7:���h'$�P,�7'P�!T�	��SF���H�R�(��E�����p����i`���D���F%:d"}��T���4��1�
�����@���F�uO��)�����P���3'~���H!��z�c(k���T�� �XS����1g���o`������W�@Q���\��
�Z������5_
��<�@A���(h;�HB����OT�	?��"UdG��)���A�2n�2Y�nK����l���B
�$Q�|;<�(��3q�U����"_���\ha��>,���)��!���I^*P�7����<PB��]X����'��	����O��	��,����T@o�H�CT�^2�v���O���e���'��%�T�@��x���J6%�����AD+ZBT,,���dR��	6Q�f��@��g������� P{�t��@5#��!���H�91��������,��TE���2�Q\��\x�E��"KL9�Q5�������.#6���W�Q��E��Hp��Q��,��}29"��H*�#��o�����K$���9)�M� ^��\�B��*��	�rU}���\�
!sn������2�m������X�)�P��<k�#���n �� �G�N���� �4	��9�q��7�Aa���KH��9Qe�h��T�6N����B���z��
5�n`�9+t�+Y6���&����\"	���u�d�V�m��~���ZC��So�fN����Q��12.=(�`��N���A^i`u�>FG��
0�DS"�S�O�~elSo�sK�!�d'�����+su����'>$����2z.������e��w~#��V;�����[A���]/Z=s @�@@ �/��&
���w��W���[����"���T_��X��0�����
NC�!�
X9[�(�uQ�
�X���U����U8��I�@�@,�P���1�28�����R(��;T�S����P��0A;���	X48�U���	`����B)���(\	�B�p>��?�"0����	U8���A�2��i�%��@L�J �%\���
H�x��C=�
H����x��,���0���@�KC
���I�JH=��A�?�4(	 ?D�U�T��U�H�
��Y�EY!��Z�E]�.�'�?��a$�b4�cD�]��)��)H�g��h��i��j��k��l��m��l������YT
SA&tB3 �
(�*��+��)������`�h�*(3�3����� %�0�XS��
��U@�$JP��4 ��\���?B�GX�������)P�
5 ��{��( E���Xx��#�V�I���V���4r9_�����)LH�	�����
X?.K�K�"�P�1�4t�e2��+�K�1!�+�>(��pG�8�#I�p�(9��0IILK���-������q�8J��A4"
�L�c3�3��6��T8�T.X�L�(����L���[�����I�X����\Md��������������������!$�T����M��G�\U��DNj��4�
88A��'�0�,2���$�Ch8�P!�����.0��@�������M�D����8�����DP��&����������P�	NED"���
���yN��0�=��O�����U
8��.�P����U����L�G��H���;��"%��N8<���4�'�	7`��@��2)��A(�R�M�.P�?��	��3��S�*���,��[��DS4u'�RE�Ai!39�S(���G��Q�#��D]�SH�<]��:EUT6p@�A%T�Q�I�T4�S0u1��h!�r�� >�T�TS%�F��1�_�A��)/`�V��5�T<�U@)���U

�@�;X& �-3-����}1%�&�<}V�T7P�_}�F����k,����4;	�z����&q��3�su��
!��W��#$���KUzMCs�0����������X�!8�	b�F0�"�SU��=�4��5����P���8��-�5=80�������8�BO��:)�Y����`��C���YL�Y�e���G��;�r����@HZ!pQ�P��KC-�������O�G���R��M���Z�<UO=��2�?�;�X����-/���8+�����`�[
>�K�3���-+Q��Z�3/�R�%��E��{�/8�-���Q
M���Z�����
�-b5]�Q*pS�7�m�����P�������y��M��{R`�ZD�]Q�R�M���T�G�-/��M^�P�T��Eh"�;��-��-�����%_y��:���?��$���
2�T��^0T�X���_��^Z��Q(�F�����
����w}��%��j��(PV�u����
��0�M��`^
��U��H�]�?������sU���1]�F���`���(�8�UPR�
�S8��2`b��|���n��,�TH(P/�^ ��I0	�I-J��#�B-��)|c'\�R�~e\�->� ��_�TI�\1^�(
�IhX�T|ES\��V�C��bAN����U1����CC^4H�@G��StdXT:�cLV�G(P2(a��VZ&���@,$�'��	�B9��6�B���*��r��i��j��k�fb�f�- �
��p�q&gdl
R>��\��CTvt����x�H�����(�pZ^~<�_P�=�_z�,��S����J��I�&J��_<�0��� >�U�����-�)�u���51���r
,�PR�E�����U�Z�e����,X�,@�����d�~�,/X��%�����g}���i��<������F ��>O���F0��~����j���	�����x
��=�u���^
O,�S�����l��-�=�]#�f
����P|F���&�FP���YT�`^���e��h�Q�c���f���?]�*�������������U��������6�R�E���nI`�������f�6����F���W�F��e�E����4��Ej>a������^��j�6�0��&,����Xl�^_��n������\����<f��^re��h���������S�Nlg��	��kd��o�[Fx����o�0`.8T���n����Q���m�����}^o����k���&�0|HX���9���q��b�������&��l�.��>j/�?��fp�mI`�����a�o�0������Ss4wp_j7����e%��;��<��=��>��?�@�A'�B7�CG�DW�Eg�Fw�G��H��I��J��K��L��M��N��O�P�Q'�R7�SG�TW�Ug�Vw�W��X��Y��Z��[��\��&y�]��w������}�fv���W<��e�e?[xD�X,�i��j��k��l��m����H��Hg����r7�sG�tW�ug�vw�w������E�08�y�I{�p?���z���Hz��U@���	p?9����|7�{������x���hE����0��_�0��[��gx��x��a�y�x���OH}���W?�oh��?��yS��{��#����^��Py�8x�Lx�_��hz�7���J(�����_�{�(�\������
������u9�
�P	����x&vb<T��w30��
LH��@	��� |8�"~�$��u�{� ��/yD��������U`���U�|�w����W|�W�����@�7��G}�`}(�`O��W�_�g{����7����������z�`�
{���3 yz��{8�W	;���p��|�HL�������-�~�h�_���VD��h��G��_��fV,h��A������2Dh!�2#4��p���G���c�U�BxQ"I�&WE`P���� c�\if��U�N���)�'	8@�aA�G-&����M�S�r���u�I��	V,���h;`lZ�fL�tA	�5������K���U���P8,����5��%S	�����F�y������������]h[�oj���&��d�z�.&������W����@�������ld���l��;�����) �`��+p&�J���ePZ���*��+>�YAA��5�~`�I6P��~��%��*8W���Zb����A
����	�b����Y��G����a�[A�6������
�a���2$wfP�AHjWF�!pH�Cu0e�W�Pyut$�D%AV�2�FDyUQMZ&��*V���4��"YJ2	�Fz��o]�y�W����^�ie�B��hW�&P�RZ�����I�A�����yw��ePX�y@&� 0�e��]_�N����}U�le���AI�gF��Rlo�9@}@���v�����aen�j���"W��mR"�a�J�z�nA����B@lf��JV���E��)�NP���6a��f���o�����1w�q�<�Q@e����J���l�����3T�������@R1�l-A>��]yr05=5WO�3*'��WVG�J��et�&���Wd�l2Tj����Jo��u�}7�1��7�}��7��>8��~8��+�8��;�8��K>9��[~9��k�9��{�9���>:���~:����:����:���>;���~;����;����;��?<��<��+�<��;�<��K?=��[=��k�=��{�=���?>��S;
#68Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#67)
Re: BufferAlloc: don't take two simultaneous locks

В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:

Tests:
- tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
(ie max frequency is 2.20GHz)

Forgot to mention:
- this time it was Centos7.9.2009 (Core) with Linux mn10 3.10.0-1160.el7.x86_64

Perhaps older kernel describes poor master's performance on 2 sockets
compared to my previous results (when this server had Linux 5.10.103-1 Debian).

Or there is degradation in PostgreSQL's master branch between.
I'll try to check today.

regards

---

Yura Sokolov

#69Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#68)
Re: BufferAlloc: don't take two simultaneous locks

В Вт, 28/06/2022 в 14:26 +0300, Yura Sokolov пишет:

В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:

Tests:
- tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
(ie max frequency is 2.20GHz)

Forgot to mention:
- this time it was Centos7.9.2009 (Core) with Linux mn10 3.10.0-1160.el7.x86_64

Perhaps older kernel describes poor master's performance on 2 sockets
compared to my previous results (when this server had Linux 5.10.103-1 Debian).

Or there is degradation in PostgreSQL's master branch between.
I'll try to check today.

No, old master commit ( 7e12256b47 Sat Mar 12 14:21:40 2022) behaves same.
So it is clearly old-kernel issue. Perhaps, futex was much slower than this
days.

#70Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Yura Sokolov (#69)
Re: BufferAlloc: don't take two simultaneous locks

On Tue, Jun 28, 2022 at 4:50 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

В Вт, 28/06/2022 в 14:26 +0300, Yura Sokolov пишет:

В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:

Tests:
- tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
(ie max frequency is 2.20GHz)

Forgot to mention:
- this time it was Centos7.9.2009 (Core) with Linux mn10

3.10.0-1160.el7.x86_64

Perhaps older kernel describes poor master's performance on 2 sockets
compared to my previous results (when this server had Linux 5.10.103-1

Debian).

Or there is degradation in PostgreSQL's master branch between.
I'll try to check today.

No, old master commit ( 7e12256b47 Sat Mar 12 14:21:40 2022) behaves same.
So it is clearly old-kernel issue. Perhaps, futex was much slower than this
days.

The patch requires a rebase; please do that.

Hunk #1 FAILED at 231.
Hunk #2 succeeded at 409 (offset 82 lines).

1 out of 2 hunks FAILED -- saving rejects to file
src/include/storage/buf_internals.h.rej

--
Ibrar Ahmed

#71Michael Paquier
michael@paquier.xyz
In reply to: Ibrar Ahmed (#70)
Re: BufferAlloc: don't take two simultaneous locks

On Wed, Sep 07, 2022 at 12:53:07PM +0500, Ibrar Ahmed wrote:

Hunk #1 FAILED at 231.
Hunk #2 succeeded at 409 (offset 82 lines).

1 out of 2 hunks FAILED -- saving rejects to file
src/include/storage/buf_internals.h.rej

With no rebase done since this notice, I have marked this entry as
RwF.
--
Michael