Cache relation sizes?

Started by Thomas Munroabout 7 years ago56 messages

thomas.munro@enterprisedb.com

about 7 years ago

Hello,

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot. For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

If we had a different kind of buffer mapping system of the kind that
Andres has described, there might be a place in shared memory that
could track the size of the relation. Even if we could do that, I
wonder if it would still be better to do a kind of per-backend
lock-free caching, like this:

1. Whenever a file size has been changed by extending or truncating
(ie immediately after the syscall), bump a shared "size_change"
invalidation counter.

2. Somewhere in SMgrRelation, store the last known size_change
counter and the last known size. In _mdnblocks(), if the counter
hasn't moved, we can use the cached size and skip the call to
FileSize().

3. To minimise false cache invalidations (caused by other relations'
size changes), instead of using a single size_change counter in shared
memory, use an array of N and map relation OIDs onto them.

4. As for memory coherency, I think it might be enough to use uint32
without locks or read barriers on the read size, since you have a view
of memory at least as new as your snapshot (the taking of which
included a memory barrier). That's good enough because we don't need
to see blocks added after our snapshot was taken (the same assumption
applies today, this just takes further advantage of it), and
truncations can't happen while we have a share lock on the relation
(the taking of which also includes memory barrier, covering the case
where the truncation happened after our snapshot and the acquisition
of the share lock on the relation). In other words, there is heavy
locking around truncation already, and for extension we don't care
about recent extensions so we can be quite relaxed about memory.
Right?

I don't have a patch for this (though I did once try it as a
throw-away hack and it seemed to work), but I just wanted to share the
idea and see if anyone sees a problem with the logic/interlocking, or
has a better idea for how to do this. It occurred to me that I might
be missing something or this would have been done already...

--
Thomas Munro
http://www.enterprisedb.com

andres@anarazel.de

about 7 years ago

In reply to: Thomas Munro (#1)

Re: Cache relation sizes?

Hi,

On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot. For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

I'd really really like to see some benchmarking before embarking on a
more complex scheme. I aesthetically dislike those lseeks, but ...

If we had a different kind of buffer mapping system of the kind that
Andres has described, there might be a place in shared memory that
could track the size of the relation. Even if we could do that, I
wonder if it would still be better to do a kind of per-backend
lock-free caching, like this:

Note that the reason for introducing that isn't primarily motivated
by getting rid of the size determining lseeks, but reducing the locking
for extending *and* truncating relations.

Greetings,

Andres Freund

ejrh00@gmail.com

about 7 years ago

In reply to: Thomas Munro (#1)

Re: Cache relation sizes?

On Wed, 7 Nov 2018 at 11:41, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Hello,

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot. For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

If we had a different kind of buffer mapping system of the kind that
Andres has described, there might be a place in shared memory that
could track the size of the relation. Even if we could do that, I
wonder if it would still be better to do a kind of per-backend
lock-free caching, like this:

On behalf of those looking after databases running over NFS (sigh!), I
think this is definitely worth exploring. Looking at the behaviour of
my (9.4.9) server, there's an lseek(SEEK_END) for every relation
(table or index) used by a query, which is a lot of them for a heavily
partitioned database. The lseek counts seem to be the same with
native partitions and 10.4.

As an incredibly rough benchmark, a "SELECT * FROM t ORDER BY pk LIMIT
0" on a table with 600 partitions, which builds a
MergeAppend/IndexScan plan, invokes lseek around 1200 times, and takes
600ms to return when repeated. (It's much slower the first time,
because the backend has to open the files, and read index blocks. I
found that increasing max_files_per_process above the number of
tables/indexes in the query made a huge difference, too!) Testing
separately, 1200 lseeks on that NFS mount takes around 400ms.

I'm aware of other improvements since 9.4.9 that would likely improve
things (the pread/pwrite change; possibly using native partitioning
instead of inheritance), but I imagine reducing lseeks would too.

(Of course, an even better improvement is to not put your data
directory on an NFS mount (sigh).)

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Andres Freund (#2)

Re: Cache relation sizes?

On 7 November 2018 at 11:46, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot. For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

I'd really really like to see some benchmarking before embarking on a
more complex scheme. I aesthetically dislike those lseeks, but ...

I agree. It would be good to see benchmarks on this first. Those
could be as simple as just some crude local backend cache that stuff
the return value of RelationGetNumberOfBlocks in estimate_rel_size
into a hashtable and does not take into account the fact that it might
change. Should be okay to do some read-only benchmarking.

The partitioning case is probably a less interesting case to improve
if we get [1]https://commitfest.postgresql.org/20/1778/ as we'll no longer ask for the size of any pruned
partitions. Queries that don't prune any partitions are less likely to
notice the extra overhead of the lseek(SEEK_END) since they've
probably got more work to do elsewhere.

[1]: https://commitfest.postgresql.org/20/1778/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: David Rowley (#4)

1 attachment(s)

Re: Cache relation sizes?

On Fri, Nov 9, 2018 at 4:42 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 7 November 2018 at 11:46, Andres Freund <andres@anarazel.de> wrote:

On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot. For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

I'd really really like to see some benchmarking before embarking on a
more complex scheme. I aesthetically dislike those lseeks, but ...

I agree. It would be good to see benchmarks on this first. Those
could be as simple as just some crude local backend cache that stuff
the return value of RelationGetNumberOfBlocks in estimate_rel_size
into a hashtable and does not take into account the fact that it might
change. Should be okay to do some read-only benchmarking.

Oh, I just found the throw-away patch I wrote ages ago down the back
of the sofa. Here's a rebase. It somehow breaks initdb so you have
to initdb with unpatched. Unfortunately I couldn't seem to measure
any speed-up on a random EDB test lab Linux box using pgbench -S (not
"prepared"), but that test doesn't involve many tables, and also it's
an older kernel without KPTI mitigations. Attached in case anyone
else would like to try it.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Cache-file-sizes-to-avoid-lseek-calls.patchapplication/octet-stream; name=0001-Cache-file-sizes-to-avoid-lseek-calls.patchDownload

From c98ece2fa6ca247a52c7f76d2d1999cc1683f34a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sat, 28 Jul 2018 01:25:12 +1200
Subject: [PATCH] Cache file sizes to avoid lseek() calls.

Use a shared invalidation counter to allow md.c to cache file
sizes for databases that reach a steady size.

XXX WIP -- highly experimental, there may be much better ideas
than this, and the memory synchronisation may not be strong enough!
Also smgrinit2 isn't a good name, obv.
---
 src/backend/storage/smgr/md.c     | 45 ++++++++++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c   | 15 +++++++++++
 src/backend/utils/init/postinit.c |  2 ++
 src/include/storage/smgr.h        |  2 ++
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 86013a5c8b..38f659afd0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -110,6 +110,8 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	BlockNumber	nblocks;	/* cached version of number of blocks */
+	uint32		relsize_change_counter;	/* used for invalidation */
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -198,6 +200,13 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
 
+typedef struct MdSharedData
+{
+	/* XXX could have an array of these, and use rel OID % nelements? */
+	pg_atomic_uint32	relsize_change_counter;
+} MdSharedData;
+
+static MdSharedData *MdShared;
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -244,6 +253,16 @@ mdinit(void)
 	}
 }
 
+void
+mdinit2(void)
+{
+	bool		found;
+
+	MdShared = ShmemInitStruct("MdShared", sizeof(MdSharedData), &found);
+	if (!found)
+		pg_atomic_init_u32(&MdShared->relsize_change_counter, 0);
+}
+
 /*
  * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
  * already created the pendingOpsTable during initialization of the startup
@@ -538,6 +557,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						nbytes, BLCKSZ, blocknum),
 				 errhint("Check free disk space.")));
 	}
+	pg_atomic_fetch_add_u32(&MdShared->relsize_change_counter, 1);
 
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
@@ -600,6 +620,8 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->nblocks = InvalidBlockNumber;
+	mdfd->relsize_change_counter = 0;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -986,6 +1008,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 		}
 		curopensegs--;
 	}
+	pg_atomic_fetch_add_u32(&MdShared->relsize_change_counter, 1);
 }
 
 /*
@@ -1950,8 +1973,24 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	uint32		relsize_change_counter = 0;
+	BlockNumber	result;
 	off_t		len;
 
+	if (MdShared)
+	{
+		/*
+		 * When acquiring a snapshot, we executed a memory barrier.  Likewise
+		 * for a relation extension lock.  Therefore we must be able to see
+		 * a value of relsize_change_counter fresh enough for our purposes.
+		 * XXX Right?  Hmm.
+		 */
+		relsize_change_counter = pg_atomic_read_u32(&MdShared->relsize_change_counter);
+		if (seg->nblocks != InvalidBlockNumber &&
+			seg->relsize_change_counter == relsize_change_counter)
+			return seg->nblocks;
+	}
+
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
 		ereport(ERROR,
@@ -1959,5 +1998,9 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 				 errmsg("could not seek to end of file \"%s\": %m",
 						FilePathName(seg->mdfd_vfd))));
 	/* note that this calculation will ignore any partial block at EOF */
-	return (BlockNumber) (len / BLCKSZ);
+	result = len / BLCKSZ;
+	seg->nblocks = result;
+	seg->relsize_change_counter = relsize_change_counter;
+
+	return result;
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86..06cd70cf03 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -38,6 +38,7 @@
 typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
+	void		(*smgr_init2) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
@@ -69,6 +70,7 @@ static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
 		.smgr_init = mdinit,
+		.smgr_init2 = mdinit2,
 		.smgr_shutdown = NULL,
 		.smgr_close = mdclose,
 		.smgr_create = mdcreate,
@@ -128,6 +130,19 @@ smgrinit(void)
 	on_proc_exit(smgrshutdown, 0);
 }
 
+/* Run after shared memory is initialized. */
+void
+smgrinit2(void)
+{
+	int			i;
+
+	for (i = 0; i < NSmgr; i++)
+	{
+		if (smgrsw[i].smgr_init2)
+			smgrsw[i].smgr_init2();
+	}
+}
+
 /*
  * on_proc_exit hook for smgr cleanup during backend shutdown
  */
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0d28..3813c64ddb 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -594,6 +594,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
 	elog(DEBUG3, "InitPostgres");
 
+	smgrinit2();
+
 	/*
 	 * Add my PGPROC struct to the ProcArray.
 	 *
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc969..45e05955b5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -81,6 +81,7 @@ typedef SMgrRelationData *SMgrRelation;
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
 extern void smgrinit(void);
+extern void smgrinit2(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
 extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln);
@@ -116,6 +117,7 @@ extern void AtEOXact_SMgr(void);
 
 /* in md.c */
 extern void mdinit(void);
+extern void mdinit2(void);
 extern void mdclose(SMgrRelation reln, ForkNumber forknum);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-- 
2.19.1

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Thomas Munro (#5)

Re: Cache relation sizes?

On Fri, 16 Nov 2018 at 12:06, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Oh, I just found the throw-away patch I wrote ages ago down the back
of the sofa. Here's a rebase. It somehow breaks initdb so you have
to initdb with unpatched. Unfortunately I couldn't seem to measure
any speed-up on a random EDB test lab Linux box using pgbench -S (not
"prepared"), but that test doesn't involve many tables, and also it's
an older kernel without KPTI mitigations. Attached in case anyone
else would like to try it.

Over on [1]/messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963D89214@G01JPEXMBKW03 there's some talk about how when using PREPAREd statements
on a table with many partitions where some of the parameters help
prune many of the partitions, that on the 6th, and only on the 6th
execution of the statement that there's a huge spike in the query
latency. This will be down to the fact that GetCachedPlan() builds a
generic plan on the 6th execution and most likely discards it due to
it appearing too expensive because of lack of any partition pruning.
The custom plan's cost is likely much much cheaper, so the generic
plan is planned but never used. This may be a good real-life
candidate to test this patch with. I know from benchmarks I performed
several months ago that the lseek() call to determine the relation
size was a large part of the cost of planning with many partitions.

[1]: /messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963D89214@G01JPEXMBKW03

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

k.jamison@jp.fujitsu.com

about 7 years ago

In reply to: David Rowley (#6)

RE: Cache relation sizes?

Hello,

I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate large tables.
As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with many partitions.
I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time,
I also monitored the lseek calls using simple strace, with and without the patch.

Below are the test results.
Setup 8192 table partitions.
(1) set plan_cache_mode = 'force_generic_plan';

[Without Patch]
prepare select_stmt(int) as select * from t where id = $1;
explain (timing off, analyze) execute select_stmt(8192);
[…]
Planning Time: 1678.680 ms
Execution Time: 643.495 ms

$ strace -p [pid] -e trace=lseek -c
% time seconds usecs/call calls errors syscall
---------------------------------------------------------------------------
100.00 0.017247 1 16385 lseek

[With Patch]
[…]
Planning Time: 1596.566 ms
Execution Time: 653.646 ms

$ strace -p [pid] -e trace=lseek -c
% time seconds usecs/call calls errors syscall
---------------------------------------------------------------------------
100.00 0.009196 1 8192 lseek

It was mentioned in the other thread [1]/messages/by-id/CAEepm=3SSw-Ty1DFcK=1rU-K6GSzYzfdD4d+ZwapdN7dTa6=nQ@mail.gmail.com that creating a generic plan for the first time is very expensive.
Although this patch did not seem to reduce the cost of planning time for force_generic_plan,
it seems that number of lseek calls was reduced into half during the first execution of generic plan.

(2) plan_cache_mode = 'auto’
reset plan_cache_mode; -- resets to auto / custom plan

[Without Patch]
[…]
Planning Time: 768.669 ms
Execution Time: 0.776 ms

$ strace -p [pid] -e trace=lseek -c
% time seconds usecs/call calls errors syscall
---------------------------------------------------------------------------
100.00 0.015117 2 8193 lseek

[With Patch]
[…]
Planning Time: 181.690 ms
Execution Time: 0.615 ms

$ strace -p [pid] -e trace=lseek -c
[…]
NO (zero) lseek calls.

Without the patch, there were around 8193 lseek calls.
With the patch applied, there were no lseek calls when creating the custom plan.

(3) set plan_cache_mode = 'force_generic_plan';
-- force it to generic plan again to use the cached plan (no re-planning)

[Without Patch]
[…]
Planning Time: 14.294 ms
Execution Time: 601.141 ms

$ strace -p [pid] -e trace=lseek -c
% time seconds usecs/call calls errors syscall
---------------------------------------------------------------------------
0.00 0.000000 0 1 lseek

[With Patch]
[…]
Planning Time: 13.976 ms
Execution Time: 570.518 ms

$ strace -p [pid] -e trace=lseek -c
[…]
NO (zero) lseek calls.

----
If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance of table with many partitions.
However, the number of lseek calls was greatly reduced with Thomas’ patch.
I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared).
I am assuming this might be applicable to other use cases as well.
(I just tested the patch, but haven’t dug up the patch details yet).

Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement?

[1]: /messages/by-id/CAEepm=3SSw-Ty1DFcK=1rU-K6GSzYzfdD4d+ZwapdN7dTa6=nQ@mail.gmail.com

Regards,
Kirk Jamison

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Jamison, Kirk (#7)

Re: Cache relation sizes?

On Thu, Dec 27, 2018 at 8:00 PM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:

I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate large tables.
As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with many partitions.
I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time,
I also monitored the lseek calls using simple strace, with and without the patch.

Thanks for looking into this and testing!

Setup 8192 table partitions.

(1) set plan_cache_mode = 'force_generic_plan';
Planning Time: 1678.680 ms
Planning Time: 1596.566 ms

(2) plan_cache_mode = 'auto’
Planning Time: 768.669 ms
Planning Time: 181.690 ms

(3) set plan_cache_mode = 'force_generic_plan';
Planning Time: 14.294 ms
Planning Time: 13.976 ms

If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance of table with many partitions.
However, the number of lseek calls was greatly reduced with Thomas’ patch.
I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared).
I am assuming this might be applicable to other use cases as well.
(I just tested the patch, but haven’t dug up the patch details yet).

The result for (2) is nice. Even though you had to use 8192
partitions to see it.

Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement?

For now I think this still in the experiment/hack phase and I have a
ton of other stuff percolating in this commitfest already (and a week
of family holiday in the middle of January). But if you have ideas
about the validity of the assumptions, the reason it breaks initdb, or
any other aspect of this approach (or alternatives), please don't let
me stop you, and of course please feel free to submit this, an
improved version or an alternative proposal yourself! Unfortunately I
wouldn't have time to nurture it this time around, beyond some
drive-by comments.

Assorted armchair speculation: I wonder how much this is affected by
the OS and KPTI, virtualisation technology, PCID support, etc. Back
in the good old days, Linux's lseek(SEEK_END) stopped acquiring the
inode mutex when reading the size, at least in the generic
implementation used by most filesystems (I wonder if our workloads
were indirectly responsible for that optimisation?) so maybe it became
about as fast as a syscall could possibly be, but now the baseline for
how fast syscalls can be has moved and it also depends on your
hardware, and it also has external costs that depend on what memory
you touch in between syscalls. Also, other operating systems might
still acquire a per-underlying-file/vnode/whatever lock (<checks
source code>... yes) and the contention for that might depend on what
else is happening, so that a single standalone test wouldn't capture
that but a super busy DB with a rapidly expanding and contracting
table that many other sessions are trying to observe with
lseek(SEEK_END) could slow down more.

--
Thomas Munro
http://www.enterprisedb.com

k.jamison@jp.fujitsu.com

about 7 years ago

In reply to: Thomas Munro (#8)

RE: Cache relation sizes?

Hi Thomas,

On Friday, December 28, 2018 6:43 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

[...]if you have ideas about the validity of the assumptions, the reason it breaks initdb, or any other aspect of this approach (or alternatives), please don't let me stop you, and of course please feel free to submit this, an improved version or an alternative proposal [...]

Sure. Thanks. I'd like to try to work on the idea. I also took a look at the code, and I hope you don't mind if I ask for clarifications (explanation/advice/opinions) on the following, since my postgres experience is not substantial enough yet.

(1) I noticed that you used a "relsize_change_counter" to store in MdSharedData. Is my understanding below correct?

The intention is to cache the rel_size per-backend (lock-free), with an array of relsize_change_counter to skip using lseek syscall when the counter does not change.
In _mdnblocks(), if the counter did not change, the cached rel size (nblocks) is used and skip the call to FileSize() (lseek to get and cache rel size). That means in the default Postgres master, lseek syscall (through FileSize()) is called whenever we want to get the rel size (nblocks).

On the other hand, the simplest method I thought that could also work is to only cache the file size (nblock) in shared memory, not in the backend process, since both nblock and relsize_change_counter are uint32 data type anyway. If relsize_change_counter can be changed without lock, then nblock can be changed without lock, is it right? In that case, nblock can be accessed directly in shared memory. In this case, is the relation size necessary to be cached in backend?

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
typedef struct MdSharedData
{
/* XXX could have an array of these, and use rel OID % nelements? */
pg_atomic_uint32 relsize_change_counter;
} MdSharedData;

static MdSharedData *MdShared;

What I intend to have is a permanent hashtable that will keep the file size (eventually/future dev, including table addresses) in the shared memory for faster access by backend processes. The idea is to keep track of the unallocated blocks, based from how much the relation has been extended or truncated. Memory for this hashtable will be dynamically allocated.

Thanks,
Kirk Jamison

Ideriha, Takeshi

ideriha.takeshi@jp.fujitsu.com

almost 7 years ago

In reply to: Jamison, Kirk (#9)

RE: Cache relation sizes?

From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com]

On the other hand, the simplest method I thought that could also work is to only cache
the file size (nblock) in shared memory, not in the backend process, since both nblock
and relsize_change_counter are uint32 data type anyway. If relsize_change_counter
can be changed without lock, then nblock can be changed without lock, is it right? In
that case, nblock can be accessed directly in shared memory. In this case, is the
relation size necessary to be cached in backend?

(Aside from which idea is better.. )
If you want to put relation size on the shared memory, then I don't think caches in backend
is necessary because every time relation_size is updated you need to invalidate cache
in backends. At the reference taking shared lock on the cache and at the update taking
exclusive lock is simple without backend cache.

(2) Is the MdSharedData temporary or permanent in shared memory?

That data structure seems to initialize at the time of InitPostgre, which means it's permanent
because postgres-initialized-shared-memory doesn't have a chance to drop it as far as I know.
(If you want to use temporary data structure, then other mechanism like dsm/dsa/dshash is a candidate.)

Regards,
Takeshi Ideriha

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Jamison, Kirk (#9)

RE: Cache relation sizes?

From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com]

On the other hand, the simplest method I thought that could also work is
to only cache the file size (nblock) in shared memory, not in the backend
process, since both nblock and relsize_change_counter are uint32 data type
anyway. If relsize_change_counter can be changed without lock, then nblock
can be changed without lock, is it right? In that case, nblock can be accessed
directly in shared memory. In this case, is the relation size necessary
to be cached in backend?

Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relation size in the RelationData structure in relcache.

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
typedef struct MdSharedData
{
/* XXX could have an array of these, and use rel OID % nelements?
*/
pg_atomic_uint32 relsize_change_counter;
} MdSharedData;

static MdSharedData *MdShared;

Permanent in shared memory.

Regards
Takayuki Tsunakawa

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#11)

Re: Cache relation sizes?

At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB955DF@G01JPEXMBYT05>

From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com]

On the other hand, the simplest method I thought that could also work is
to only cache the file size (nblock) in shared memory, not in the backend
process, since both nblock and relsize_change_counter are uint32 data type
anyway. If relsize_change_counter can be changed without lock, then nblock
can be changed without lock, is it right? In that case, nblock can be accessed
directly in shared memory. In this case, is the relation size necessary
to be cached in backend?

Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relation size in the RelationData structure in relcache.

Just one counter in the patch *seems* to give significant gain
comparing to the complexity, given that lseek is so complex or it
brings latency, especially on workloads where file is scarcely
changed. Though I didn't run it on a test bench.

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
typedef struct MdSharedData
{
/* XXX could have an array of these, and use rel OID % nelements?
*/
pg_atomic_uint32 relsize_change_counter;
} MdSharedData;

static MdSharedData *MdShared;

Permanent in shared memory.

I'm not sure the duration of the 'permanent' there, but it
disappears when server stops. Anyway it doesn't need to be
permanent beyond a server restart.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#12)

RE: Cache relation sizes?

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]

Just one counter in the patch *seems* to give significant gain
comparing to the complexity, given that lseek is so complex or it
brings latency, especially on workloads where file is scarcely
changed. Though I didn't run it on a test bench.

I expect so, too.

I'm not sure the duration of the 'permanent' there, but it
disappears when server stops. Anyway it doesn't need to be
permanent beyond a server restart.

Right, it exists while the server is running.

Regards
Takayuki Tsunakawa

k.jamison@jp.fujitsu.com

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#12)

RE: Cache relation sizes?

On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote:

At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote:

Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relation size in the RelationData structure in relcache.

Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complex or it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench.

(2) Is the MdSharedData temporary or permanent in shared memory?

Permanent in shared memory.

I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be permanent beyond a server restart.

Thank you for the insights.
I did a simple test in the previous email using simple syscall tracing,
the patch significantly reduced the number of lseek syscall.
(but that simple test might not be enough to describe the performance benefit)

Regarding Tsunakawa-san's comment,
in Thomas' patch, he made a place in shared memory that stores the
relsize_change_counter, so I am thinking of utilizing the same,
but for caching the relsize itself.

Perhaps I should explain further the intention for the design.

First step, to cache the file size in the shared memory. Similar to the
intention or purpose of the patch written by Thomas, to reduce the
number of lseek(SEEK_END) by caching the relation size without using
additional locks. The difference is by caching a rel size on the shared
memory itself. I wonder if there are problems that you can see with
this approach.

Eventually, the next step is to have a structure in shared memory
that caches file addresses along with their sizes (Andres' idea of
putting an open relations table in the shared memory). With a
structure that group buffers into file relation units, we can get
file size directly from shared memory, so the assumption is it would
be faster whenever we truncate/extend our relations because we can
track the offset of the changes in size and use range for managing
the truncation, etc..
The next step is a complex direction that needs serious discussion,
but I am wondering if we can proceed with the first step for now if
the idea and direction are valid.

Regards,
Kirk Jamison

andres@anarazel.de

andres@anarazel.de

almost 7 years ago

In reply to: Jamison, Kirk (#14)

Re: Cache relation sizes?

On 2019-02-06 08:50:45 +0000, Jamison, Kirk wrote:

On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote:

At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote:

Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relation size in the RelationData structure in relcache.

Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complex or it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench.

(2) Is the MdSharedData temporary or permanent in shared memory?

Permanent in shared memory.

I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be permanent beyond a server restart.

Thank you for the insights.
I did a simple test in the previous email using simple syscall tracing,
the patch significantly reduced the number of lseek syscall.
(but that simple test might not be enough to describe the performance benefit)

Regarding Tsunakawa-san's comment,
in Thomas' patch, he made a place in shared memory that stores the
relsize_change_counter, so I am thinking of utilizing the same,
but for caching the relsize itself.

Perhaps I should explain further the intention for the design.

First step, to cache the file size in the shared memory. Similar to the
intention or purpose of the patch written by Thomas, to reduce the
number of lseek(SEEK_END) by caching the relation size without using
additional locks. The difference is by caching a rel size on the shared
memory itself. I wonder if there are problems that you can see with
this approach.

Eventually, the next step is to have a structure in shared memory
that caches file addresses along with their sizes (Andres' idea of
putting an open relations table in the shared memory). With a
structure that group buffers into file relation units, we can get
file size directly from shared memory, so the assumption is it would
be faster whenever we truncate/extend our relations because we can
track the offset of the changes in size and use range for managing
the truncation, etc..
The next step is a complex direction that needs serious discussion,
but I am wondering if we can proceed with the first step for now if
the idea and direction are valid.

Maybe I'm missing something here, but why is it actually necessary to
have the sizes in shared memory, if we're just talking about caching
sizes? It's pretty darn cheap to determine the filesize of a file that
has been recently stat()/lseek()/ed, and introducing per-file shared
data adds *substantial* complexity, because the amount of shared memory
needs to be pre-determined. The reason I want to put per-relation data
into shared memory is different, it's about putting the buffer mapping
into shared memory, and that, as a prerequisite, also need per-relation
data. And there's a limit of the number of relation sthat need to be
open (one per cached page at max), and space can be freed by evicting
pages.

Greetings,

Andres Freund

k.jamison@jp.fujitsu.com

almost 7 years ago

In reply to: andres@anarazel.de (#15)

RE: Cache relation sizes?

On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:

Maybe I'm missing something here, but why is it actually necessary to
have the sizes in shared memory, if we're just talking about caching
sizes? It's pretty darn cheap to determine the filesize of a file that
has been recently stat()/lseek()/ed, and introducing per-file shared
data adds *substantial* complexity, because the amount of shared memory
needs to be pre-determined. The reason I want to put per-relation data
into shared memory is different, it's about putting the buffer mapping
into shared memory, and that, as a prerequisite, also need per-relation
data. And there's a limit of the number of relations that need to be
open (one per cached page at max), and space can be freed by evicting
pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss.
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them.
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

Regards,
Kirk Jamison

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Jamison, Kirk (#16)

Re: Cache relation sizes?

At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <k.jamison@jp.fujitsu.com> wrote in <D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>

On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:

Maybe I'm missing something here, but why is it actually necessary to
have the sizes in shared memory, if we're just talking about caching
sizes? It's pretty darn cheap to determine the filesize of a file that
has been recently stat()/lseek()/ed, and introducing per-file shared
data adds *substantial* complexity, because the amount of shared memory
needs to be pre-determined. The reason I want to put per-relation data
into shared memory is different, it's about putting the buffer mapping
into shared memory, and that, as a prerequisite, also need per-relation
data. And there's a limit of the number of relations that need to be
open (one per cached page at max), and space can be freed by evicting
pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss.
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them.
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#17)

Re: Cache relation sizes?

2019年2月14日(木) 20:41、Kyotaro HORIGUCHI さん（horiguchi.kyotaro@lab.ntt.co.jp
）のメッセージ:

At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <
k.jamison@jp.fujitsu.com> wrote in
<D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>

On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:

Maybe I'm missing something here, but why is it actually necessary to
have the sizes in shared memory, if we're just talking about caching
sizes? It's pretty darn cheap to determine the filesize of a file that
has been recently stat()/lseek()/ed, and introducing per-file shared
data adds *substantial* complexity, because the amount of shared memory
needs to be pre-determined. The reason I want to put per-relation data
into shared memory is different, it's about putting the buffer mapping
into shared memory, and that, as a prerequisite, also need per-relation
data. And there's a limit of the number of relations that need to be
open (one per cached page at max), and space can be freed by evicting
pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss.
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them.
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

Of course all the "flag"s above actually are "update counter"s :)

--
Kyotaro Horiguchi
NTT Open Source Software Center

thomas.munro@gmail.com

about 6 years ago

In reply to: Kyotaro HORIGUCHI (#17)

Re: Cache relation sizes?

On Tue, Dec 31, 2019 at 4:43 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

There is one potentially interesting case that doesn't require any
kind of shared cache invalidation AFAICS. XLogReadBufferExtended()
calls smgrnblocks() for every buffer access, even if the buffer is
already in our buffer pool. I tried to make yet another quick
experiment-grade patch to cache the size[1]https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks (just a test, unfinished, probably has bugs), this time for use in
recovery only.

initdb -D pgdata
postgres -D pgdata -c checkpoint_timeout=60min

In another shell:
pgbench -i -s100 postgres
pgbench -M prepared -T60 postgres
killall -9 postgres
mv pgdata pgdata-save

Master branch:

cp -r pgdata-save pgdata
strace -c -f postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
...
18.61 22.492286 26 849396 lseek
6.95 8.404369 30 277134 pwrite64
6.63 8.009679 28 277892 pread64
0.50 0.604037 39 15169 sync_file_range
...

Patched:

rm -fr pgdata
cp -r pgdata-save pgdata
strace -c -f ~/install/bin/postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
...
16.33 8.097631 29 277134 pwrite64
15.56 7.715052 27 277892 pread64
1.13 0.559648 39 14137 sync_file_range
...
0.00 0.001505 25 59 lseek

Note: I'm still not sure how much lseek impacts performance.

It doesn't seem great that we are effectively making system calls for
most WAL records we replay, but, sadly, in this case the patch didn't
really make any measurable difference when run without strace on this
Linux VM. I suspect there is some workload and stack where it would
make a difference (CF the read(postmaster pipe) call for every WAL
record that was removed), but this is just something I noticed in
passing while working on something else, so I haven't investigated
much.

[1]: https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks (just a test, unfinished, probably has bugs)
(just a test, unfinished, probably has bugs)

andres@anarazel.de

almost 6 years ago

In reply to: Thomas Munro (#19)

Re: Cache relation sizes?

Hi,

On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:

There is one potentially interesting case that doesn't require any
kind of shared cache invalidation AFAICS. XLogReadBufferExtended()
calls smgrnblocks() for every buffer access, even if the buffer is
already in our buffer pool.

Yea, that's really quite bad*. The bit about doing so even when already
in the buffer pool is particularly absurd. Needing to have special
handling in mdcreate() for XLogReadBufferExtended() always calling it is
also fairly ugly.

It doesn't seem great that we are effectively making system calls for
most WAL records we replay, but, sadly, in this case the patch didn't
really make any measurable difference when run without strace on this
Linux VM. I suspect there is some workload and stack where it would
make a difference (CF the read(postmaster pipe) call for every WAL
record that was removed), but this is just something I noticed in
passing while working on something else, so I haven't investigated
much.

I wonder if that's just because your workload is too significantly
bottlenecked elsewhere:

postgres -D pgdata -c checkpoint_timeout=60min

In another shell:
pgbench -i -s100 postgres
pgbench -M prepared -T60 postgres
killall -9 postgres
mv pgdata pgdata-save

With scale 100, but the default shared_buffers, you'll frequently hit
the OS for reads/writes. Which will require the same metadata in the
kernel, but then also memcpys between kernel and userspace.

A word of caution about strace's -c: In my experience the total time
measurements are very imprecise somehow. I think it might be that some
of the overhead of ptracing will be attributed to the syscalls or such,
which means frequent syscalls appear relatively more expensive than they
really are.

Greetings,

Andres Freund

* it insults my sense of aesthetics

thomas.munro@gmail.com

almost 6 years ago

In reply to: Andres Freund (#20)

1 attachment(s)

Re: Cache relation sizes?

On Tue, Feb 4, 2020 at 2:23 AM Andres Freund <andres@anarazel.de> wrote:

On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:

There is one potentially interesting case that doesn't require any
kind of shared cache invalidation AFAICS. XLogReadBufferExtended()
calls smgrnblocks() for every buffer access, even if the buffer is
already in our buffer pool.

Yea, that's really quite bad*. The bit about doing so even when already
in the buffer pool is particularly absurd. Needing to have special
handling in mdcreate() for XLogReadBufferExtended() always calling it is
also fairly ugly.

Yeah. It seems like there are several things to fix there. So now
I'm wondering if we should start out by trying to cache the size it in
the smgr layer for recovery only, like in the attached, and then later
try to extend the scheme to cover the shared case as discussed at the
beginning of the thread.

A word of caution about strace's -c: In my experience the total time
measurements are very imprecise somehow. I think it might be that some
of the overhead of ptracing will be attributed to the syscalls or such,
which means frequent syscalls appear relatively more expensive than they
really are.

Yeah, those times are large, meaningless tracing overheads. While
some systems might in fact be happy replaying a couple of million
lseeks per second, (1) I'm pretty sure some systems would not be happy
about that (see claims in this thread) and (2) it means you can't
practically use strace on recovery because it slows it down to a
crawl, which is annoying.

Attachments:

0001-Use-cached-smgrnblocks-results-in-recovery.patchapplication/octet-stream; name=0001-Use-cached-smgrnblocks-results-in-recovery.patchDownload

From 860472f3ca4b55ecdb885b3676e82e12592e03de Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Dec 2019 13:22:49 +1300
Subject: [PATCH] Use cached smgrnblocks() results in recovery.

Avoid repeatedly calling lseek(SEEK_END) during recovery by caching
the size of each fork.  For now, we can't use the same technique
during normal running, because we lack a shared invalidation mechanism
when relations grow.

Do this by generalizing the pre-existing caching used by FSM and VM
to support all forks.

Author: Thomas Munro
Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c     |  2 +-
 src/backend/access/heap/visibilitymap.c   | 18 ++++-----
 src/backend/catalog/storage.c             |  4 +-
 src/backend/storage/freespace/freespace.c | 27 +++++++------
 src/backend/storage/smgr/smgr.c           | 49 +++++++++++++++++------
 src/include/storage/smgr.h                | 12 +++---
 6 files changed, 66 insertions(+), 46 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 0cd1160ceb..e7982c8d84 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,7 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40..b1072183bc 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -561,17 +561,16 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 * If we haven't cached the size of the visibility map fork yet, check it
 	 * first.
 	 */
-	if (rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber)
+	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
 	{
 		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			rel->rd_smgr->smgr_vm_nblocks = smgrnblocks(rel->rd_smgr,
-														VISIBILITYMAP_FORKNUM);
+			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 		else
-			rel->rd_smgr->smgr_vm_nblocks = 0;
+			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_vm_nblocks)
+	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -641,11 +640,13 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
 	 * positive then it must exist, no need for an smgrexists call.
 	 */
-	if ((rel->rd_smgr->smgr_vm_nblocks == 0 ||
-		 rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber) &&
+	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
+		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
 		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
+	/* Invalidate cache so that smgrnblocks() asks the kernel. */
+	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
@@ -667,8 +668,5 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 */
 	CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
 
-	/* Update local cache with the up-to-date size */
-	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
-
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..75b8aa7675 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -242,8 +242,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
-	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+	for (int i = 0; i <= MAX_FORKNUM; ++i)
+		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index a5083db02b..419b98eff4 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -539,18 +539,19 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * value might be stale.  (We send smgr inval messages on truncation, but
 	 * not on extension.)
 	 */
-	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
+		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
 	{
+		/* Invalidate the cache so smgrnblocks asks the kernel. */
+		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
-														 FSM_FORKNUM);
+			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 		else
-			rel->rd_smgr->smgr_fsm_nblocks = 0;
+			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -619,14 +620,17 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	RelationOpenSmgr(rel);
 
 	/*
-	 * Create the FSM file first if it doesn't exist.  If smgr_fsm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
+	 * Create the FSM file first if it doesn't exist.  If
+	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no need
+	 * for an smgrexists call.
 	 */
-	if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
-		 rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
+	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
+		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
 		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
+	/* Invalidate cache so that smgrnblocks() asks the kernel. */
+	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
@@ -638,9 +642,6 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 		fsm_nblocks_now++;
 	}
 
-	/* Update local cache with the up-to-date size */
-	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
-
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..0995e1c4e9 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -17,6 +17,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -174,8 +175,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		reln->smgr_fsm_nblocks = InvalidBlockNumber;
-		reln->smgr_vm_nblocks = InvalidBlockNumber;
+		for (int i = 0 ; i <= MAX_FORKNUM; ++i)
+			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -485,6 +486,16 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
+
+	/*
+	 * Normally we expect this to increase nblocks by one, but if the
+	 * cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
 /*
@@ -554,7 +565,20 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 BlockNumber
 smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+	BlockNumber result;
+
+	/*
+	 * For now, we only use cached values in recovery due to lack of shared
+	 * invalidation for extensions.
+	*/
+	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
+		return reln->smgr_cached_nblocks[forknum];
+
+	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+
+	reln->smgr_cached_nblocks[forknum] = result;
+
+	return result;
 }
 
 /*
@@ -593,20 +617,19 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
+		/* Make the cached size is invalid if we encounter an error. */
+		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
+
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
 
 		/*
-		 * We might as well update the local smgr_fsm_nblocks and
-		 * smgr_vm_nblocks settings. The smgr cache inval message that
-		 * this function sent will cause other backends to invalidate
-		 * their copies of smgr_fsm_nblocks and smgr_vm_nblocks,
-		 * and these ones too at the next command boundary.
-		 * But these ensure they aren't outright wrong until then.
+		 * We might as well update the local smgr_cached_nblocks.  The smgr
+		 * cache inval message that this function sent will cause other
+		 * backends to invalidate their copies, and these ones too at the next
+		 * command boundary.  But these ensure they aren't outright wrong until
+		 * then.
 		 */
-		if (forknum[i] == FSM_FORKNUM)
-			reln->smgr_fsm_nblocks = nblocks[i];
-		if (forknum[i] == VISIBILITYMAP_FORKNUM)
-			reln->smgr_vm_nblocks = nblocks[i];
+		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
 	}
 }
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..64771d51ff 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -45,15 +45,13 @@ typedef struct SMgrRelationData
 	struct SMgrRelationData **smgr_owner;
 
 	/*
-	 * These next three fields are not actually used or manipulated by smgr,
-	 * except that they are reset to InvalidBlockNumber upon a cache flush
-	 * event (in particular, upon truncation of the relation).  Higher levels
-	 * store cached state here so that it will be reset when truncation
-	 * happens.  In all three cases, InvalidBlockNumber means "unknown".
+	 * The following fields are reset to InvalidBlockNumber upon a cache flush
+	 * event, and hold the last known size for each fork.  This information is
+	 * currently only reliable during recovery, since there is no cache
+	 * invalidation for fork extension.
 	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_fsm_nblocks;	/* last known size of fsm fork */
-	BlockNumber smgr_vm_nblocks;	/* last known size of vm fork */
+	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
-- 
2.20.1

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#21)

Re: Cache relation sizes?

On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... (1) I'm pretty sure some systems would not be happy
about that (see claims in this thread) ...

I poked a couple of people off-list and learned that, although the
Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
calls in 60-100ms, a couple of different Windows systems took between
1.1 and 3.4 seconds (worse times when running as non-administrator),
which seems to be clearly in the territory that can put a dent in your
recovery speeds on that OS. I also learned that GetFileSizeEx() is
"only" about twice as fast, which is useful information for that other
thread about which syscall to use for this, but it's kind of
irrelevant this thread about how we can get rid of these crazy
syscalls altogether.

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#22)

Re: Cache relation sizes?

On Fri, Feb 14, 2020 at 1:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... (1) I'm pretty sure some systems would not be happy
about that (see claims in this thread) ...

I poked a couple of people off-list and learned that, although the
Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
calls in 60-100ms, a couple of different Windows systems took between
1.1 and 3.4 seconds (worse times when running as non-administrator),
which seems to be clearly in the territory that can put a dent in your
recovery speeds on that OS. I also learned that GetFileSizeEx() is
"only" about twice as fast, which is useful information for that other
thread about which syscall to use for this, but it's kind of
irrelevant this thread about how we can get rid of these crazy
syscalls altogether.

I received a report off-list from someone who experimented with the
patch I shared earlier on this thread[1]https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks, using a crash recovery test
similar to one I showed on the WAL prefetching thread[2]/messages/by-id/CA+hUKG+OcWr8nHqa3=ZoPTGgdDcrgjSC4R2sT+jrUunBua3rpg@mail.gmail.com (which he was
also testing, separately).

He observed that the lseek() rate in recovery was actually a
significant problem for his environment on unpatched master, showing
up as the top sampled function in perf, and by using that patch he got
(identical) crash recovery to complete in 41s instead of 65s, with a
sane looking perf (= 58% speedup). His test system was an AWS
i3.16xlarge running an unspecified version of Linux.

I think it's possible that all of the above reports can be explained
by variations in the speculative execution bug mitigations that are
enabled by default on different systems, but I haven't tried to
investigate that yet.

[1]: https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
[2]: /messages/by-id/CA+hUKG+OcWr8nHqa3=ZoPTGgdDcrgjSC4R2sT+jrUunBua3rpg@mail.gmail.com

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#23)

1 attachment(s)

Re: Cache relation sizes?

On Sat, Apr 11, 2020 at 4:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I received a report off-list from someone who experimented with the
patch I shared earlier on this thread[1], using a crash recovery test
similar to one I showed on the WAL prefetching thread[2] (which he was
also testing, separately).

Rebased. I'll add this to the open commitfest.

Attachments:

v2-0001-Cache-smgrnblocks-results-in-recovery.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Cache-smgrnblocks-results-in-recovery.patchDownload

From 03662f6a546329b58d41897d3b8096d5794eacdc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 31 Dec 2019 13:22:49 +1300
Subject: [PATCH v2] Cache smgrnblocks() results in recovery.

Avoid repeatedly calling lseek(SEEK_END) during recovery by caching
the size of each fork.  For now, we can't use the same technique in
other processes (for example: query planning and seq scans are two
sources of high frequency lseek(SEEK_END) calls), because we lack a
shared invalidation mechanism.

Do this by generalizing the pre-existing caching used by FSM and VM
to support all forks.

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c     |  2 +-
 src/backend/access/heap/visibilitymap.c   | 18 ++++-----
 src/backend/catalog/storage.c             |  4 +-
 src/backend/storage/freespace/freespace.c | 27 +++++++------
 src/backend/storage/smgr/smgr.c           | 49 +++++++++++++++++------
 src/include/storage/smgr.h                | 12 +++---
 6 files changed, 66 insertions(+), 46 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e..e731161734 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,7 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40..b1072183bc 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -561,17 +561,16 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 * If we haven't cached the size of the visibility map fork yet, check it
 	 * first.
 	 */
-	if (rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber)
+	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
 	{
 		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			rel->rd_smgr->smgr_vm_nblocks = smgrnblocks(rel->rd_smgr,
-														VISIBILITYMAP_FORKNUM);
+			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 		else
-			rel->rd_smgr->smgr_vm_nblocks = 0;
+			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_vm_nblocks)
+	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -641,11 +640,13 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
 	 * positive then it must exist, no need for an smgrexists call.
 	 */
-	if ((rel->rd_smgr->smgr_vm_nblocks == 0 ||
-		 rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber) &&
+	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
+		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
 		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
+	/* Invalidate cache so that smgrnblocks() asks the kernel. */
+	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
@@ -667,8 +668,5 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 */
 	CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
 
-	/* Update local cache with the up-to-date size */
-	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
-
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ec143b640a..9e6e6c42d3 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -290,8 +290,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
-	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+	for (int i = 0; i <= MAX_FORKNUM; ++i)
+		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 95a21f6cc3..6a96126b0c 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -541,18 +541,19 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * value might be stale.  (We send smgr inval messages on truncation, but
 	 * not on extension.)
 	 */
-	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
+		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
 	{
+		/* Invalidate the cache so smgrnblocks asks the kernel. */
+		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
-														 FSM_FORKNUM);
+			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 		else
-			rel->rd_smgr->smgr_fsm_nblocks = 0;
+			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -621,14 +622,17 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	RelationOpenSmgr(rel);
 
 	/*
-	 * Create the FSM file first if it doesn't exist.  If smgr_fsm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
+	 * Create the FSM file first if it doesn't exist.  If
+	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
+	 * need for an smgrexists call.
 	 */
-	if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
-		 rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
+	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
+		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
 		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
+	/* Invalidate cache so that smgrnblocks() asks the kernel. */
+	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
@@ -640,9 +644,6 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 		fsm_nblocks_now++;
 	}
 
-	/* Update local cache with the up-to-date size */
-	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
-
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7d667c6586..d60cc3ab61 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -17,6 +17,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -174,8 +175,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		reln->smgr_fsm_nblocks = InvalidBlockNumber;
-		reln->smgr_vm_nblocks = InvalidBlockNumber;
+		for (int i = 0; i <= MAX_FORKNUM; ++i)
+			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -464,6 +465,16 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
+
+	/*
+	 * Normally we expect this to increase nblocks by one, but if the cached
+	 * value isn't as expected, just invalidate it so the next call asks the
+	 * kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
 /*
@@ -537,7 +548,20 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 BlockNumber
 smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+	BlockNumber result;
+
+	/*
+	 * For now, we only use cached values in recovery due to lack of shared
+	 * invalidation for extensions.
+	 */
+	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
+		return reln->smgr_cached_nblocks[forknum];
+
+	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+
+	reln->smgr_cached_nblocks[forknum] = result;
+
+	return result;
 }
 
 /*
@@ -576,20 +600,19 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
+		/* Make the cached size is invalid if we encounter an error. */
+		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
+
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
 
 		/*
-		 * We might as well update the local smgr_fsm_nblocks and
-		 * smgr_vm_nblocks settings. The smgr cache inval message that this
-		 * function sent will cause other backends to invalidate their copies
-		 * of smgr_fsm_nblocks and smgr_vm_nblocks, and these ones too at the
-		 * next command boundary. But these ensure they aren't outright wrong
-		 * until then.
+		 * We might as well update the local smgr_cached_nblocks values. The
+		 * smgr cache inval message that this function sent will cause other
+		 * backends to invalidate their copies of smgr_fsm_nblocks and
+		 * smgr_vm_nblocks, and these ones too at the next command boundary.
+		 * But these ensure they aren't outright wrong until then.
 		 */
-		if (forknum[i] == FSM_FORKNUM)
-			reln->smgr_fsm_nblocks = nblocks[i];
-		if (forknum[i] == VISIBILITYMAP_FORKNUM)
-			reln->smgr_vm_nblocks = nblocks[i];
+		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
 	}
 }
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 6566659593..f28a842401 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -45,15 +45,13 @@ typedef struct SMgrRelationData
 	struct SMgrRelationData **smgr_owner;
 
 	/*
-	 * These next three fields are not actually used or manipulated by smgr,
-	 * except that they are reset to InvalidBlockNumber upon a cache flush
-	 * event (in particular, upon truncation of the relation).  Higher levels
-	 * store cached state here so that it will be reset when truncation
-	 * happens.  In all three cases, InvalidBlockNumber means "unknown".
+	 * The following fields are reset to InvalidBlockNumber upon a cache flush
+	 * event, and hold the last known size for each fork.  This information is
+	 * currently only reliable during recovery, since there is no cache
+	 * invalidation for fork extension.
 	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_fsm_nblocks;	/* last known size of fsm fork */
-	BlockNumber smgr_vm_nblocks;	/* last known size of vm fork */
+	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
-- 
2.20.1

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#24)

Re: Cache relation sizes?

On Sat, Jun 20, 2020 at 10:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Rebased. I'll add this to the open commitfest.

I traced the recovery process while running pgbench -M prepared -c16
-j16 -t10000 (= 160,000 transactions). With the patch, the number of
lseeks went from 1,080,661 (6.75 per pgbench transaction) to just 85.

I went ahead and pushed this patch.

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction). Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#25)

1 attachment(s)

Re: Cache relation sizes?

On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction). Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.

I couldn't help hacking on this a bit. Perhaps instead of
bullet-proof general shared invalidation, we should have a way for
localised bits of code to state that they are ok with a "relaxed"
value. Then they should explain the theory for why that is safe in
each case based on arguments about memory barrier pairs, but leave all
other callers making the system call so that we don't torpedo the
whole project by making it too hard. For now, the main cases I'm
interested in are the ones that show up all the time as the dominant
system call in various workloads:

(1) Sequential scan relation-size probe. This should be OK with a
relaxed value. You can't miss the invalidation for a truncation,
because the read barrier in your lock acquisition on the relation
pairs with the write barrier in the exclusive lock release of any
session that truncated, and you can't miss relation any relation
extension that your snapshot can see, because the release of the
extension lock pairs with the lock involved in snapshot acquisition.

(2) Planner relation-size probe, which should be OK with a relaxed
value. Similar arguments give you a fresh enough view, I think.

Or maybe there is a theory along these lines that already covers every
use of smgrnblocks(), so a separate mode isn't require... I don't
know!

The attached sketch-quality patch shows some of the required elements
working, though I ran out of steam trying to figure out how to thread
this thing through the right API layers so for now it always asks for
a relaxed value in table_block_relation_size().

Attachments:

0001-WIP-Cache-smgrnblocks-in-more-cases.patchtext/x-patch; charset=US-ASCII; name=0001-WIP-Cache-smgrnblocks-in-more-cases.patchDownload

From 0392adb59be052bb5d04a4b2e0d151cefdd810e4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 1 Aug 2020 09:30:41 +1200
Subject: [PATCH] WIP: Cache smgrnblocks() in more cases.

This is just early sketch code and may be completely wrong...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 src/backend/access/gist/gistbuild.c       |  2 +-
 src/backend/access/heap/visibilitymap.c   |  6 +-
 src/backend/access/table/tableam.c        |  4 +-
 src/backend/access/transam/xlogutils.c    |  2 +-
 src/backend/catalog/storage.c             |  4 +-
 src/backend/storage/buffer/bufmgr.c       |  4 +-
 src/backend/storage/freespace/freespace.c |  6 +-
 src/backend/storage/ipc/ipci.c            |  2 +
 src/backend/storage/smgr/smgr.c           | 85 +++++++++++++++++++----
 src/include/storage/smgr.h                | 11 ++-
 10 files changed, 97 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 671b5e9186..d5baa8d21a 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -495,7 +495,7 @@ gistBuildCallback(Relation index,
 	 */
 	if ((buildstate->bufferingMode == GIST_BUFFERING_AUTO &&
 		 buildstate->indtuples % BUFFERING_MODE_SWITCH_CHECK_STEP == 0 &&
-		 effective_cache_size < smgrnblocks(index->rd_smgr, MAIN_FORKNUM)) ||
+		 effective_cache_size < smgrnblocks(index->rd_smgr, MAIN_FORKNUM, 0)) ||
 		(buildstate->bufferingMode == GIST_BUFFERING_STATS &&
 		 buildstate->indtuples >= BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET))
 	{
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bc..242b4a183f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -528,7 +528,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 	else
 		newnblocks = truncBlock;
 
-	if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) <= newnblocks)
+	if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM, 0) <= newnblocks)
 	{
 		/* nothing to do, the file was already smaller than requested size */
 		return InvalidBlockNumber;
@@ -564,7 +564,7 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
 	{
 		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM, 0);
 		else
 			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
@@ -647,7 +647,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	/* Invalidate cache so that smgrnblocks() asks the kernel. */
 	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
-	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM, 0);
 
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 3afb63b1fe..e8ead8f019 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -628,10 +628,10 @@ table_block_relation_size(Relation rel, ForkNumber forkNumber)
 	if (forkNumber == InvalidForkNumber)
 	{
 		for (int i = 0; i < MAX_FORKNUM; i++)
-			nblocks += smgrnblocks(rel->rd_smgr, i);
+			nblocks += smgrnblocks(rel->rd_smgr, i, 0);
 	}
 	else
-		nblocks = smgrnblocks(rel->rd_smgr, forkNumber);
+		nblocks = smgrnblocks(rel->rd_smgr, forkNumber, SMGRNBLOCKS_RELAXED);
 
 	return nblocks * BLCKSZ;
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b2ca0cd4cf..c7768a7595 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -461,7 +461,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	 */
 	smgrcreate(smgr, forknum, true);
 
-	lastblock = smgrnblocks(smgr, forknum);
+	lastblock = smgrnblocks(smgr, forknum, 0);
 
 	if (blkno < lastblock)
 	{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9e6e6c42d3..5e359ff5f2 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -434,7 +434,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
-	nblocks = smgrnblocks(src, forkNum);
+	nblocks = smgrnblocks(src, forkNum, 0);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
@@ -721,7 +721,7 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
 			{
 				if (smgrexists(srel, fork))
 				{
-					BlockNumber n = smgrnblocks(srel, fork);
+					BlockNumber n = smgrnblocks(srel, fork, 0);
 
 					/* we shouldn't come here for unlogged relations */
 					Assert(fork != INIT_FORKNUM);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f1ae6f9f84..35806e60ea 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -739,7 +739,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Substitute proper block number if caller asked for P_NEW */
 	if (isExtend)
-		blockNum = smgrnblocks(smgr, forkNum);
+		blockNum = smgrnblocks(smgr, forkNum, 0);
 
 	if (isLocalBuf)
 	{
@@ -2862,7 +2862,7 @@ RelationGetNumberOfBlocksInFork(Relation relation, ForkNumber forkNum)
 			/* Open it at the smgr level if not already done */
 			RelationOpenSmgr(relation);
 
-			return smgrnblocks(relation->rd_smgr, forkNum);
+			return smgrnblocks(relation->rd_smgr, forkNum, 0);
 
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..e4d93b4794 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -317,7 +317,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	else
 	{
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address);
-		if (smgrnblocks(rel->rd_smgr, FSM_FORKNUM) <= new_nfsmblocks)
+		if (smgrnblocks(rel->rd_smgr, FSM_FORKNUM, 0) <= new_nfsmblocks)
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 	}
@@ -547,7 +547,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		/* Invalidate the cache so smgrnblocks asks the kernel. */
 		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+			smgrnblocks(rel->rd_smgr, FSM_FORKNUM, 0);
 		else
 			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
@@ -633,7 +633,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 
 	/* Invalidate cache so that smgrnblocks() asks the kernel. */
 	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM, 0);
 
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..1182d3f0df 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -120,6 +120,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, SpinlockSemaSize());
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
+		size = add_size(size, smgr_shmem_estimate_size());
 		size = add_size(size, dsm_estimate_size());
 		size = add_size(size, BufferShmemSize());
 		size = add_size(size, LockShmemSize());
@@ -220,6 +221,7 @@ CreateSharedMemoryAndSemaphores(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
+	smgr_shmem_init();
 	InitBufferPool();
 
 	/*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..218826e2b1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -22,10 +22,17 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
+#include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/* Control object in shared memory. */
+typedef struct SMgrShared
+{
+	pg_atomic_uint64 nblocks_inval[1024];
+} SMgrShared;
+
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -98,6 +105,50 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+static SMgrShared *smgr_shared;
+
+size_t
+smgr_shmem_estimate_size(void)
+{
+	return sizeof(*smgr_shared);
+}
+
+void
+smgr_shmem_init(void)
+{
+	bool		found;
+
+	smgr_shared = ShmemInitStruct("SMgr", sizeof(*smgr_shared), &found);
+	if (!found)
+	{
+		for (size_t i = 0; i < lengthof(smgr_shared->nblocks_inval); ++i)
+			pg_atomic_init_u64(&smgr_shared->nblocks_inval[i], 0);
+	}
+}
+
+/*
+ * Read the invalidation counter that a given relation maps to.
+ */
+static uint64
+smgrnblocks_get_inval(SMgrRelation reln)
+{
+	size_t		slot;
+
+	slot = reln->smgr_rnode.node.relNode % lengthof(smgr_shared->nblocks_inval);
+	return pg_atomic_read_u64(&smgr_shared->nblocks_inval[slot]);
+}
+
+/*
+ * Increment the invalidation counter that a given relation maps to.
+ */
+static uint64
+smgrnblocks_inc_inval(SMgrRelation reln)
+{
+	size_t		slot;
+
+	slot = reln->smgr_rnode.node.relNode % lengthof(smgr_shared->nblocks_inval);
+	return pg_atomic_fetch_add_u64(&smgr_shared->nblocks_inval[slot], 1) + 1;
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -466,15 +517,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
 
-	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
-	 * value isn't as expected, just invalidate it so the next call asks the
-	 * kernel.
-	 */
-	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
-	else
-		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+	reln->smgr_cached_nblocks_inval[forknum] = smgrnblocks_inc_inval(reln);
+	reln->smgr_cached_nblocks[forknum] = blocknum + 1;
 }
 
 /*
@@ -543,23 +587,35 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 /*
  *	smgrnblocks() -- Calculate the number of blocks in the
- *					 supplied relation.
+ *					 supplied relation, optionally returning a relaxed value
+ *					 that already out of date, but not older than the most
+ *					 recent read barrier that pairs with the most recent
+ * 					 write barrier in any process that earlier changed the
+ *					 true value.
  */
 BlockNumber
-smgrnblocks(SMgrRelation reln, ForkNumber forknum)
+smgrnblocks(SMgrRelation reln, ForkNumber forknum, int flags)
 {
 	BlockNumber result;
+	uint64		inval;
 
 	/*
-	 * For now, we only use cached values in recovery due to lack of a shared
-	 * invalidation mechanism for changes in file size.
+	 * We can use cached values in recovery since no other process can change
+	 * the size of a relation.  Otherwise, we only use a cached value if the
+	 * caller said that a relaxed value is OK, and the shared invalidation
+	 * counter hasn't moved.
 	 */
-	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
+	inval = smgrnblocks_get_inval(reln);
+	if ((InRecovery ||
+		 ((flags & SMGRNBLOCKS_RELAXED) &&
+		  inval == reln->smgr_cached_nblocks_inval[forknum])) &&
+		reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
 		return reln->smgr_cached_nblocks[forknum];
 
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
 	reln->smgr_cached_nblocks[forknum] = result;
+	reln->smgr_cached_nblocks_inval[forknum] = inval;
 
 	return result;
 }
@@ -614,6 +670,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 		 */
 		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
 	}
+	smgrnblocks_inc_inval(reln);
 }
 
 /*
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..af8a5cb874 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -14,10 +14,14 @@
 #ifndef SMGR_H
 #define SMGR_H
 
+#include "port/atomics.h"
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/* Flags for smgrnblocks(). */
+#define SMGRNBLOCKS_RELAXED 0x01
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -52,6 +56,7 @@ typedef struct SMgrRelationData
 	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
 	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
+	uint64		smgr_cached_nblocks_inval[MAX_FORKNUM + 1];
 
 	/* additional public fields may someday exist here */
 
@@ -77,6 +82,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern size_t smgr_shmem_estimate_size(void);
+extern void smgr_shmem_init(void);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -98,7 +106,8 @@ extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
+extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum,
+							   int flags);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-- 
2.20.1

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 5 years ago

In reply to: Thomas Munro (#26)

Re: Cache relation sizes?

On 01.08.2020 00:56, Thomas Munro wrote:

On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction). Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.

I couldn't help hacking on this a bit. Perhaps instead of
bullet-proof general shared invalidation, we should have a way for
localised bits of code to state that they are ok with a "relaxed"
value. Then they should explain the theory for why that is safe in
each case based on arguments about memory barrier pairs, but leave all
other callers making the system call so that we don't torpedo the
whole project by making it too hard. For now, the main cases I'm
interested in are the ones that show up all the time as the dominant
system call in various workloads:

(1) Sequential scan relation-size probe. This should be OK with a
relaxed value. You can't miss the invalidation for a truncation,
because the read barrier in your lock acquisition on the relation
pairs with the write barrier in the exclusive lock release of any
session that truncated, and you can't miss relation any relation
extension that your snapshot can see, because the release of the
extension lock pairs with the lock involved in snapshot acquisition.

(2) Planner relation-size probe, which should be OK with a relaxed
value. Similar arguments give you a fresh enough view, I think.

Or maybe there is a theory along these lines that already covers every
use of smgrnblocks(), so a separate mode isn't require... I don't
know!

The attached sketch-quality patch shows some of the required elements
working, though I ran out of steam trying to figure out how to thread
this thing through the right API layers so for now it always asks for
a relaxed value in table_block_relation_size().

So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

pavel.stehule@gmail.com

over 5 years ago

In reply to: Konstantin Knizhnik (#27)

Re: Cache relation sizes?

po 3. 8. 2020 v 17:54 odesílatel Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> napsal:

On 01.08.2020 00:56, Thomas Munro wrote:

On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com>

wrote:

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction). Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.

I couldn't help hacking on this a bit. Perhaps instead of
bullet-proof general shared invalidation, we should have a way for
localised bits of code to state that they are ok with a "relaxed"
value. Then they should explain the theory for why that is safe in
each case based on arguments about memory barrier pairs, but leave all
other callers making the system call so that we don't torpedo the
whole project by making it too hard. For now, the main cases I'm
interested in are the ones that show up all the time as the dominant
system call in various workloads:

(1) Sequential scan relation-size probe. This should be OK with a
relaxed value. You can't miss the invalidation for a truncation,
because the read barrier in your lock acquisition on the relation
pairs with the write barrier in the exclusive lock release of any
session that truncated, and you can't miss relation any relation
extension that your snapshot can see, because the release of the
extension lock pairs with the lock involved in snapshot acquisition.

(2) Planner relation-size probe, which should be OK with a relaxed
value. Similar arguments give you a fresh enough view, I think.

Or maybe there is a theory along these lines that already covers every
use of smgrnblocks(), so a separate mode isn't require... I don't
know!

The attached sketch-quality patch shows some of the required elements
working, though I ran out of steam trying to figure out how to thread
this thing through the right API layers so for now it always asks for
a relaxed value in table_block_relation_size().

So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.

Some applications use very frequently CREATE TEMP TABLE, DROP TEMP TABLE,
or CREATE TABLE AS SELECT ..

Regards

Pavel

So in most cases we will have only shared locks. May be overhead of

Show quoted text

locking will not be too large?

thomas.munro@gmail.com

over 5 years ago

In reply to: Konstantin Knizhnik (#27)

Re: Cache relation sizes?

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

Hi Konstantin,

By the way, point 2 is now committed (c5315f4f). As for 1 vs 3, I
wasn't proposing two different invalidation techniques: in both
approaches, I'm calling the cached values "relaxed", meaning that
their freshness is controlled by memory barriers elsewhere that the
caller has to worry about. I'm just suggesting for idea 3 that it
might be a good idea to use relaxed values only in a couple of hot
code paths where we do the analysis required to convince ourselves
that memory barriers are already in the right places to make it safe.
By "bullet-proof" I meant that we could in theory convince ourselves
that *all* users of smgrnblocks() can safely use relaxed values, but
that's hard.

That said, the sketch patch I posted certainly needs more work, and
maybe someone has a better idea on how to do it.

If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?

I wouldn't say it requires too much effort, at least the conservative
approach (3). But I also agree with what you're saying, in the long
term:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...). I agree with you that our cache
memory usage is a big problem, and it will be great to fix that one
day. I don't think that should stop us from making small improvements
to the existing design in the meantime, though. "The perfect is the
enemy of the good." Look at all this trivial stuff:

https://wiki.postgresql.org/wiki/Syscall_Reduction

I don't have high quality data to report yet, but from simple tests
I'm seeing orders of magnitude fewer syscalls per pgbench transaction
in recovery, when comparing 11 to 14devel, due to various changes.
Admittedly, the size probes in regular backends aren't quite as bad as
the recovery ones were, because they're already limited by the rate
you can throw request/response cycles at the DB, but there are still
some cases like planning for high-partition counts and slow
filesystems that can benefit measurably from caching.

thomas.munro@gmail.com

about 5 years ago

In reply to: Thomas Munro (#29)

1 attachment(s)

Re: Cache relation sizes?

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...). ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1]/messages/by-id/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com, I decided to try to prototype a system for
tracking relation sizes properly in shared memory. Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance. In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size. The scheme is described in
the commit message and comments. The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd. By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache". It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory. It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers. Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1]: /messages/by-id/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com

Attachments:

0001-WIP-Track-relation-sizes-in-shared-memory.patchtext/x-patch; charset=US-ASCII; name=0001-WIP-Track-relation-sizes-in-shared-memory.patchDownload

From f34ac22ace4d6cee15e150357d361f73fead2755 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH] WIP: Track relation sizes in shared memory.

Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.  "Dirty" SMgrSharedRelations can only be evicted
after being synced to disk.  Goals:

1.  Provide fast lookups of relation sizes, cutting down on lseek()
calls.  This supercedes the recovery-only caching added recently, and
replaces preexisting FSM and VM caching schemes.

2.  Stop trusting the operating system to keep track of the size of
files that we have recently extended, until fsync() has been called.

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

XXX currently reusine the array of locks also used for buffer mapping,
need to define some more in lwlocks.c...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c     |   1 -
 src/backend/access/heap/visibilitymap.c   |  28 +-
 src/backend/catalog/storage.c             |   2 -
 src/backend/storage/freespace/freespace.c |  35 +-
 src/backend/storage/ipc/ipci.c            |   3 +
 src/backend/storage/smgr/md.c             |  10 +
 src/backend/storage/smgr/smgr.c           | 554 ++++++++++++++++++++--
 src/backend/utils/misc/guc.c              |  11 +
 src/include/storage/smgr.h                |  21 +-
 9 files changed, 576 insertions(+), 89 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 54e47b810f..702951e487 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bc..8fe29ecae7 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -547,6 +547,7 @@ static Buffer
 vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 {
 	Buffer		buf;
+	BlockNumber nblocks;
 
 	/*
 	 * We might not have opened the relation at the smgr level yet, or we
@@ -557,20 +558,12 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the visibility map fork yet, check it
-	 * first.
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
-	{
-		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
-	}
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+	nblocks = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -636,17 +629,10 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+	/* * Create the file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..45f62030aa 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -290,8 +290,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	for (int i = 0; i <= MAX_FORKNUM; ++i)
-		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..629923c4d4 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -531,29 +531,18 @@ static Buffer
 fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
+	BlockNumber nblocks;
 	Buffer		buf;
 
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the FSM yet, check it first.  Also
-	 * recheck if the requested block seems to be past end, since our cached
-	 * value might be stale.  (We send smgr inval messages on truncation, but
-	 * not on extension.)
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
-	{
-		/* Invalidate the cache so smgrnblocks asks the kernel. */
-		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
-	}
+	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+		nblocks = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+	else
+		nblocks = 0;
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -621,18 +610,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	/* Create the FSM file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..cfc55d3691 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -121,6 +121,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
 		size = add_size(size, dsm_estimate_size());
+		size = add_size(size, smgr_shmem_size());
 		size = add_size(size, BufferShmemSize());
 		size = add_size(size, LockShmemSize());
 		size = add_size(size, PredicateLockShmemSize());
@@ -212,6 +213,8 @@ CreateSharedMemoryAndSemaphores(void)
 
 	dsm_shmem_init();
 
+	smgr_shmem_init();
+
 	/*
 	 * Set up xlog, clog, and buffers
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1d4aa482cc..6228447e31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1332,6 +1332,16 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	/*
+	 * XXX: We could have an interface smgrbeginclean() that would return true
+	 * if it has managed to set SR_SYNC and clean SR_JUST_DIRTIED, and then if
+	 * so, after our sync runs we could call smgrfinishclean() with our
+	 * success/failure report, which would clear SR_DIRTY if SR_JUST_DIRTIED
+	 * hasn't been set in the meantime.  But... how can we know if *this*
+	 * segment is one that represents an extension?  SR_DIRTY is just
+	 * interested in syncing extended blocks.
+	 */
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..31f895019b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,13 +19,66 @@
 
 #include "access/xlog.h"
 #include "lib/ilist.h"
+#include "port/atomics.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/shmem.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/*
+ * An entry in the hash table that allows us to look up objects in the
+ * SMgrSharedRelation pool by rnode (+ backend).
+ */
+typedef struct SMgrSharedRelationMapping
+{
+	RelFileNodeBackend rnode;
+	int				index;
+} SMgrSharedRelationMapping;
+
+/*
+ * An object in shared memory tracks the size of the forks of a relation.
+ */
+struct SMgrSharedRelation
+{
+	RelFileNodeBackend rnode;
+	BlockNumber		nblocks[MAX_FORKNUM + 1];
+	pg_atomic_uint32 flags;
+	pg_atomic_uint64 generation;		/* mapping change */
+};
+
+/* For now, we borrow the buffer managers array of locks.  XXX fixme */
+#define SR_PARTITIONS NUM_BUFFER_PARTITIONS
+#define SR_PARTITION_LOCK(hash) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET].lock)
+
+/* Flags. */
+#define SR_LOCKED					0x01
+#define SR_VALID					0x02
+
+/* Each forknum gets its own dirty, syncing and just dirtied bits. */
+#define SR_DIRTY(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
+#define SR_SYNCING(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
+#define SR_JUST_DIRTIED(forknum)	(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))
+
+/* Masks to test if any forknum is currently dirty or syncing. */
+#define SR_SYNCING_MASK				(((SR_SYNCING(MAX_FORKNUM + 1) - 1) ^ (SR_SYNCING(0) - 1)))
+#define SR_DIRTY_MASK				(((SR_DIRTY(MAX_FORKNUM + 1) - 1) ^ (SR_DIRTY(0) - 1)))
+
+/* Extract the lowest dirty forknum from flags (there must be at least one). */
+#define SR_GET_ONE_DIRTY(mask)		pg_rightmost_one_pos32((((mask) >> 2) & (SR_DIRTY_MASK >> 2)))
+
+typedef struct SMgrSharedRelationPool
+{
+	pg_atomic_uint32 next;
+	SMgrSharedRelation objects[FLEXIBLE_ARRAY_MEMBER];
+} SMgrSharedRelationPool;
+
+static SMgrSharedRelationPool *sr_pool;
+static HTAB *sr_mapping_table;
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -98,6 +151,454 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/* GUCs. */
+int smgr_shared_relations = 1000;
+
+/*
+ * Try to get the size of a relation's fork without locking.
+ */
+static BlockNumber
+smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelation *sr = reln->smgr_shared;
+	BlockNumber result;
+
+	if (sr)
+	{
+		pg_read_barrier();
+
+		/* We can load int-sized values atomically without special measures. */
+		Assert(sizeof(sr->nblocks[forknum]) == sizeof(uint32));
+		result = sr->nblocks[forknum];
+
+		/*
+		 * With a read barrier between the loads, we can check that the object
+		 * still refers to the same rnode before trusting the answer.
+		 */
+		pg_read_barrier();
+		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+			return result;
+
+		/*
+		 * The generation doesn't match, the shared relation must have been
+		 * evicted since we got a pointer to it.  We'll need to do more work.
+		 */
+	}
+
+	return InvalidBlockNumber;
+}
+
+/*
+ * Try to get the size of a relation's fork by looking it up in the mapping
+ * table with a share lock.  This will succeed if the SMgrRelation already
+ * exists.
+ */
+static BlockNumber
+smgrnblocks_slow(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	BlockNumber result = InvalidBlockNumber;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		result = sr->nblocks[forknum];
+
+		/* We can take the fast path until this SR is eventually evicted. */
+		reln->smgr_shared = sr;
+		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
+	}
+	LWLockRelease(mapping_lock);
+
+	return result;
+}
+
+/*
+ * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
+ * only a few instructions.  The return value is the current set of flags,
+ * which may be modified and then passed to smgr_unlock_sr() to be atomically
+ * when the lock is released.
+ */
+static uint32
+smgr_lock_sr(SMgrSharedRelation *sr)
+{
+
+	for (;;)
+	{
+		uint32	old_flags = pg_atomic_read_u32(&sr->flags);
+		uint32	flags;
+
+		if (!(old_flags & SR_LOCKED))
+		{
+			flags = old_flags | SR_LOCKED;
+			if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
+				return flags;
+		}
+	}
+	return 0; /* unreachable */
+}
+
+/*
+ * Unlock a SMgrSharedRelation, atomically updating its flags at the same
+ * time.
+ */
+static void
+smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
+{
+	pg_write_barrier();
+	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
+}
+
+/*
+ * Allocate a new invalid SMgrSharedRelation, and return it locked.
+ *
+ * The replacement algorithm is a simple FIFO design with no second chance for
+ * now.
+ */
+static SMgrSharedRelation *
+smgr_alloc_sr(void)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32 index;
+	LWLock *mapping_lock;
+	uint32 flags;
+	RelFileNodeBackend rnode;
+	uint32 hash;
+
+ retry:
+	/* Lock the next one in clock-hand order. */
+	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+	sr = &sr_pool->objects[index];
+	flags = smgr_lock_sr(sr);
+
+	/* If it's unused, can return it, still locked, immediately. */
+	if (!(flags & SR_VALID))
+		return sr;
+
+	/*
+	 * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
+	 * locks, but we need to do it in that order, so we'll unlock the SR
+	 * first.
+	 */
+	rnode = sr->rnode;
+	smgr_unlock_sr(sr, flags);
+
+	hash = get_hash_value(sr_mapping_table, &rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (!mapping || mapping->index != index)
+	{
+		/* Too slow, it's gone or now points somewhere else.  Go around. */
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/* We will lock the SR for just a few instructions. */
+	flags = smgr_lock_sr(sr);
+	Assert(flags & SR_VALID);
+
+	/*
+	 * If another backend is currently syncing any fork, we aren't allowed to
+	 * evict it, and waiting for it would be pointless because that other
+	 * backend already plans to allocate it.  So go around.
+	 */
+	if (flags & SR_SYNCING_MASK)
+	{
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/*
+	 * We will sync every fork that is dirty, and then we'll try to
+	 * evict it.
+	 */
+	while (flags & SR_DIRTY_MASK)
+	{
+		SMgrRelation reln;
+		ForkNumber forknum = SR_GET_ONE_DIRTY(flags);
+
+		/* Set the sync bit, clear the just-dirtied bit and unlock. */
+		flags |= SR_SYNCING(forknum);
+		flags &= ~SR_JUST_DIRTIED(forknum);
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+
+		/*
+		 * Perform the I/O, with no locks held.
+		 * XXX It sucks that we fsync every segment, not just the ones that need it...
+		 */
+		reln = smgropen(rnode.node, rnode.backend);
+		smgrimmedsync(reln, forknum);
+
+		/*
+		 * Reacquire the locks.  The object can't have been evicted,
+		 * because we set a sync bit.
+		 * XXX And what if it's dropped?
+		 */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_SYNCING(forknum));
+		if (flags & SR_JUST_DIRTIED(forknum))
+		{
+			/*
+			 * Someone else dirtied it while we were syncing, so we can't mark
+			 * it clean.  Let's give up on this SR and go around again.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+
+		/* This fork is clean! */
+		flags &= ~SR_SYNCING(forknum);
+		flags &= ~SR_DIRTY(forknum);
+	}
+
+	/*
+	 * If we made it this far, there are no dirty forks, so we're now allowed
+	 * to evict the SR from the pool and the mapping table.  Make sure that
+	 * smgrnblocks_fast() sees that its pointer is now invalid by bumping the
+	 * generation.
+	 */
+	flags &= ~SR_VALID;
+	pg_atomic_write_u64(&sr->generation,
+						pg_atomic_read_u64(&sr->generation) + 1);
+	pg_write_barrier();
+	smgr_unlock_sr(sr, flags);
+
+	/* Remove from the mapping table. */
+	hash_search_with_hash_value(sr_mapping_table,
+								&rnode,
+								hash,
+								HASH_REMOVE,
+								NULL);
+	LWLockRelease(mapping_lock);
+
+	/*
+	 * XXX: We unlock while doing HASH_REMOVE on principle.  Maybe it'd be OK
+	 * to hold it now that the clock hand is far away and there is no way
+	 * anyone can look up this SR through buffer mapping table.
+	 */
+	flags = smgr_lock_sr(sr);
+	if (flags & SR_VALID)
+	{
+		/* Oops, someone else got it. */
+		smgr_unlock_sr(sr, flags);
+		goto retry;
+	}
+
+	return sr;
+}
+
+/*
+ * Set the number of blocks in a relation, in shared memory, and optionally
+ * also mark the relation as "dirty" (meaning the it must be fsync'd before it
+ * can be evicted).
+ */
+static void
+smgrnblocks_update(SMgrRelation reln,
+				   ForkNumber forknum,
+				   BlockNumber nblocks,
+				   bool mark_dirty)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr = NULL;
+	uint32		hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+ retry:
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		flags = smgr_lock_sr(sr);
+		if (mark_dirty)
+		{
+			/*
+			 * Extend and truncate clobber the value, and there are no races
+			 * to worry about because they can have higher level exclusive
+			 * locking on the relation.
+			 */
+			sr->nblocks[forknum] = nblocks;
+
+			/*
+			 * Mark it dirty, and if it's currently being sync'd, make sure it
+			 * stays dirty after that completes.
+			 */
+			flags |= SR_DIRTY(forknum);
+			if (flags & SR_SYNCING(forknum))
+				flags |= SR_JUST_DIRTIED(forknum);
+		}
+		else if (!(flags & SR_DIRTY(forknum)))
+		{
+			/*
+			 * We won't clobber a dirty value with a non-dirty update, to
+			 * avoid races against concurrent extend/truncate, but we can
+			 * install a new clean value.
+			 */
+			sr->nblocks[forknum] = nblocks;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+	LWLockRelease(mapping_lock);
+
+	/* If we didn't find it, then we'll need to allocate one. */
+	if (!sr)
+	{
+		bool found;
+
+		sr = smgr_alloc_sr();
+
+		/* Upgrade to exclusive lock so we can create a mapping. */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		mapping = hash_search_with_hash_value(sr_mapping_table,
+											  &reln->smgr_rnode,
+											  hash,
+											  HASH_ENTER,
+											  &found);
+		if (!found)
+		{
+			/* Success!  Initialize. */
+			mapping->index = sr - sr_pool->objects;
+			smgr_unlock_sr(sr, SR_VALID);
+			sr->rnode = reln->smgr_rnode;
+			pg_atomic_write_u64(&sr->generation,
+								pg_atomic_read_u64(&sr->generation) + 1);
+			for (int i = 0; i <= MAX_FORKNUM; ++i)
+				sr->nblocks[i] = InvalidBlockNumber;
+			LWLockRelease(mapping_lock);
+		}
+		else
+		{
+			/* Someone beat us to it.  Go around again. */
+			smgr_unlock_sr(sr, 0);		/* = not valid */
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+	}
+}
+
+static void
+smgr_drop_sr(SMgrRelation reln)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+
+	retry:
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_VALID);
+
+		if (flags & SR_SYNCING_MASK)
+		{
+			/*
+			 * Oops, someone's syncing one of its forks; nothing to do but
+			 * wait.
+			 */
+			smgr_unlock_sr(sr, flags);
+			pg_usleep(100000);
+			goto retry;
+		}
+
+		/* Mark it invalid and drop the mapping. */
+		smgr_unlock_sr(sr, ~SR_VALID);
+		hash_search_with_hash_value(sr_mapping_table,
+									&reln->smgr_rnode,
+									hash,
+									HASH_REMOVE,
+									NULL);
+	}
+	LWLockRelease(mapping_lock);
+}
+
+size_t
+smgr_shmem_size(void)
+{
+	size_t size = 0;
+
+	size = add_size(size,
+					sizeof(offsetof(SMgrSharedRelationPool, objects) +
+						   sizeof(SMgrSharedRelation) * smgr_shared_relations));
+	size = add_size(size,
+					hash_estimate_size(smgr_shared_relations,
+									   sizeof(SMgrSharedRelationMapping)));
+
+	return size;
+}
+
+void
+smgr_shmem_init(void)
+{
+	HASHCTL		info;
+	bool found;
+
+	info.keysize = sizeof(RelFileNodeBackend);
+	info.entrysize = sizeof(SMgrSharedRelationMapping);
+	info.num_partitions = SR_PARTITIONS;
+	sr_mapping_table = ShmemInitHash("SMgrSharedRelation Mapping Table",
+									 smgr_shared_relations,
+									 smgr_shared_relations,
+									 &info,
+									 HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+
+	sr_pool = ShmemInitStruct("SMgrSharedRelation Pool",
+							  offsetof(SMgrSharedRelationPool, objects) +
+							  sizeof(SMgrSharedRelation) * smgr_shared_relations,
+							  &found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&sr_pool->next, 0);
+		for (uint32 i = 0; i < smgr_shared_relations; ++i)
+		{
+			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
+		}
+	}
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -175,8 +676,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		for (int i = 0; i <= MAX_FORKNUM; ++i)
-			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
+		reln->smgr_shared = NULL;
+		reln->smgr_shared_generation = 0;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -247,6 +748,9 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
+	if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
+		return true;
+
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
@@ -430,6 +934,9 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 		CacheInvalidateSmgr(rnodes[i]);
 
+	for (i = 0; i < nrels; i++)
+		smgr_drop_sr(rels[i]);
+
 	/*
 	 * Delete the physical file(s).
 	 *
@@ -465,16 +972,7 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
-
-	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
-	 * value isn't as expected, just invalidate it so the next call asks the
-	 * kernel.
-	 */
-	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
-	else
-		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+	smgrnblocks_update(reln, forknum, blocknum + 1, true);
 }
 
 /*
@@ -550,16 +1048,21 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
-	/*
-	 * For now, we only use cached values in recovery due to lack of a shared
-	 * invalidation mechanism for changes in file size.
-	 */
-	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
-		return reln->smgr_cached_nblocks[forknum];
+	/* Can we get the answer from shared memory without locking? */
+	result = smgrnblocks_fast(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
+
+	/* Can we get the answer from shared memory with only a share lock? */
+	result = smgrnblocks_slow(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
 
+	/* Ask the kernel. */
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
-	reln->smgr_cached_nblocks[forknum] = result;
+	/* Update the value in shared memory for faster service next time. */
+	smgrnblocks_update(reln, forknum, result, false);
 
 	return result;
 }
@@ -600,19 +1103,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
-		/* Make the cached size is invalid if we encounter an error. */
-		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
-
-		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_fsm_nblocks and
-		 * smgr_vm_nblocks, and these ones too at the next command boundary.
-		 * But these ensure they aren't outright wrong until then.
-		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		smgrnblocks_update(reln, forknum[i], nblocks[i], true);
 	}
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..175ac555e6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
+#include "storage/smgr.h"
 #include "storage/standby.h"
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
@@ -2445,6 +2446,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_shared_relations", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the number of shared relation objects in memory at one time."),
+			NULL
+		},
+		&smgr_shared_relations,
+		1000, 64, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..b2cab478a2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,13 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/* GUCs. */
+extern int smgr_shared_relations;
+
+/* Definition private to smgr.c. */
+struct SMgrSharedRelation;
+typedef struct SMgrSharedRelation SMgrSharedRelation;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -44,14 +51,11 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
-	/*
-	 * The following fields are reset to InvalidBlockNumber upon a cache flush
-	 * event, and hold the last known size for each fork.  This information is
-	 * currently only reliable during recovery, since there is no cache
-	 * invalidation for fork extension.
-	 */
+	/* pointer to shared object, valid if non-NULL and generation matches */
+	SMgrSharedRelation *smgr_shared;
+	uint64		smgr_shared_generation;
+
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
@@ -77,6 +81,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern size_t smgr_shmem_size(void);
+extern void smgr_shmem_init(void);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
-- 
2.20.1

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 5 years ago

In reply to: Thomas Munro (#30)

Re: Cache relation sizes?

On 16.11.2020 10:11, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...). ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory. Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance. In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size. The scheme is described in
the commit message and comments. The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd. By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache". It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory. It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers. Here, "dirty"
means the size changed.

I will look at your implementation more precisely latter.
But now I just to ask one question: is it reasonable to spent
substantial amount of efforts trying to solve
the problem of redundant calculations of relation size, which seems to
be just part of more general problem: shared relation cache?

It is well known problem: private caches of system catalog can cause
unacceptable memory consumption for large number of backends.
If there are thousands relations in the database (which is not so rare
case, especially taken in account ORMs and temporary tables)
then size of catalog cache can be several megabytes. If it is multiplies
by thousand backends, then we get gigabytes of memory used just for
private catalog caches. Thanks to Andres optimizations of taking
snapshots, now Postgres can normally handle thousands of connections.
So this extremely inefficient use of memory for private catalog caches
becomes more and more critical.
Also private caches requires sophisticated and error prone invalidation
mechanism.

If we will have such shared cache, then keeping shared relation size
seems to be trivial task, isn't it?
So may be we before "diving" into the problem of maintaining shared pool
of dirtied "inodes" we can better discuss and conerntrate on solving
more generic problem?

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 5 years ago

In reply to: Thomas Munro (#30)

Re: Cache relation sizes?

On 16.11.2020 10:11, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...). ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory. Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance. In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size. The scheme is described in
the commit message and comments. The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd. By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache". It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory. It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers. Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1] /messages/by-id/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com

I noticed that there are several fragments like this:

if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);

fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);

I wonder if it will be more efficient and simplify code to add
"create_if_not_exists" parameter to smgrnblocks?
It will avoid extra hash lookup and avoid explicit checks for fork
presence in multiple places?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

thomas.munro@gmail.com

about 5 years ago

In reply to: Konstantin Knizhnik (#31)

Re: Cache relation sizes?

On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

I will look at your implementation more precisely latter.

Thanks! Warning: I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs... Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers. (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

But now I just to ask one question: is it reasonable to spent
substantial amount of efforts trying to solve
the problem of redundant calculations of relation size, which seems to
be just part of more general problem: shared relation cache?

It's a good question. I don't know exactly how to make a shared
relation cache (I have only some vague ideas and read some of
Idehira-san's work) but I agree that it would be very nice. However,
I think that's an independent and higher level thing, because Relation
!= SMgrRelation.

What is an SMgrRelation? Not much! It holds file descriptors, which
are different in every backend so you can't share them, and then some
information about sizes and a target block for insertions. With this
patch, the sizes become shared and more reliable, so there's *almost*
nothing left at all except for the file descriptors and the pointer to
the shared object. I haven't looked into smgr_targblock, the last
remaining non-shared thing, but there might be an opportunity to do
something clever for parallel inserters (the proposed feature) by
coordinating target blocks in SMgrSharedRelation; I don't know.

It is well known problem: private caches of system catalog can cause
unacceptable memory consumption for large number of backends.

Yeah.

Also private caches requires sophisticated and error prone invalidation
mechanism.

If we will have such shared cache, then keeping shared relation size
seems to be trivial task, isn't it?
So may be we before "diving" into the problem of maintaining shared pool
of dirtied "inodes" we can better discuss and conerntrate on solving
more generic problem?

Sure, we can talk about how to make a shared relation cache (and
syscache etc). I think it'll be much harder than this shared SMGR
concept. Even if we figure out how to share palloc'd trees of nodes
with correct invalidation and interlocking (ie like Ideriha-san's work
in [1]/messages/by-id/4E72940DA2BF16479384A86D54D0988A567B9245@G01JPEXMBKW04), including good performance and space management/replacement,
I guess we'll still need per-backend Relation objects to point to the
per-backend SMgrRelation objects to hold the per-backend file
descriptors. (Until we have threads...). But... do you think sharing
SMGR relations to the maximum extent possible conflicts with that?
Are you thinking there should be some general component here, perhaps
a generic system for pools of shmem objects with mapping tables and a
replacement policy?

[1]: /messages/by-id/4E72940DA2BF16479384A86D54D0988A567B9245@G01JPEXMBKW04

tsunakawa.takay@fujitsu.com

tsunakawa.takay@fujitsu.com

about 5 years ago

In reply to: Thomas Munro (#33)

RE: Cache relation sizes?

From: Thomas Munro <thomas.munro@gmail.com>

On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

I will look at your implementation more precisely latter.

Thanks! Warning: I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs... Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers. (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

We'd like to join the discussion and review, too, so that Kirk's optimization for VACUUM truncation and TRUNCATRE can extend beyond recovery to operations during normal operation. In that regard, would you mind becoming a committer for his patch? We think it's committable now. I'm fine with using the unmodified smgrnblocks() as your devil's suggestion.

It's a good question. I don't know exactly how to make a shared
relation cache (I have only some vague ideas and read some of
Idehira-san's work) but I agree that it would be very nice. However,
I think that's an independent and higher level thing, because Relation
!= SMgrRelation.

Sure, we can talk about how to make a shared relation cache (and
syscache etc). I think it'll be much harder than this shared SMGR
concept. Even if we figure out how to share palloc'd trees of nodes
with correct invalidation and interlocking (ie like Ideriha-san's work
in [1]), including good performance and space management/replacement,
I guess we'll still need per-backend Relation objects to point to the
per-backend SMgrRelation objects to hold the per-backend file
descriptors. (Until we have threads...). But... do you think sharing
SMGR relations to the maximum extent possible conflicts with that?
Are you thinking there should be some general component here, perhaps
a generic system for pools of shmem objects with mapping tables and a
replacement policy?

Yes, Ideriha-san and we have completed the feature called global metacache for placing relation and catalog caches in shared memory and limiting their sizes, and incorporated it in our product to meet our customer's request. I hope we'll submit the patch in the near future when our resource permits.

Regards
Takayuki Tsunakawa

k.jamison@fujitsu.com

k.jamison@fujitsu.com

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#34)

RE: Cache relation sizes?

On Tuesday, November 17, 2020 9:40 AM, Tsunkawa-san wrote:

From: Thomas Munro <thomas.munro@gmail.com>

On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

I will look at your implementation more precisely latter.

Thanks! Warning: I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs... Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers. (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

We'd like to join the discussion and review, too, so that Kirk's optimization for
VACUUM truncation and TRUNCATRE can extend beyond recovery to
operations during normal operation. In that regard, would you mind
becoming a committer for his patch? We think it's committable now. I'm
fine with using the unmodified smgrnblocks() as your devil's suggestion.

Just saw this thread has been bumped. And yes, regarding the vacuum and truncate
optimization during recovery, it would still work even without the unmodified smgrnblocks.
We'd just need to remove the conditions that checks the flag parameter. And if we explore
your latest prototype patch, it can possibly use the new smgrnblocks_xxxx APIs introduced here.
That said, we'd like to explore the next step of extending the feature to normal operations.

Regards,
Kirk Jamison

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Thomas Munro (#30)

Re: Cache relation sizes?

At Mon, 16 Nov 2020 20:11:52 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory. Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance. In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size. The scheme is described in
the commit message and comments. The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd. By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache". It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory. It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers. Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1] /messages/by-id/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com

I was naively thinking that we could just remember the size of all
database files in a shared array but I realized that that needs a hash
table to translate rnodes into array indexes, which could grow huge..

The proposed way tries to make sure that the next fseek call tells the
truth before forgetting cached values. On the other hand a
shared-smgrrel allows to defer fsyncing of a (local) smgrrel until
someone evicts the entry. That seems to me to be the minimal
mechanism that allows us to keep us being able to know the right file
size at all times, getting rid of a need to remember the size of all
database files in shared memory. I'm afraid that it might cause fsync
storms on very-huge systems, though..

However, if we try to do the similar thing in any other way, it seems
to be that it's going grown to be more like the pg_smgr catalog. But
that seems going to eat more memory and cause invalidation storms.

Sorry for the rambling thought, but I think this is basically in the
right direction.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

amit.kapila16@gmail.com

about 5 years ago

In reply to: Thomas Munro (#33)

Re: Cache relation sizes?

On Tue, Nov 17, 2020 at 4:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

I will look at your implementation more precisely latter.

Thanks! Warning: I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs... Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers. (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

Yeah, it is good to verify VACUUM stuff but I have another question
here. What about queries having functions that access the same
relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
access the relation t1)? Now, here I think because the relation 't1'
is already opened, it might use the same value of blocks from the
shared cache even though the snapshot for relation 't1' when accessed
from func() might be different. Am, I missing something, or is it
dealt in some way?

--
With Regards,
Amit Kapila.

thomas.munro@gmail.com

about 5 years ago

In reply to: Amit Kapila (#37)

2 attachment(s)

Re: Cache relation sizes?

On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, it is good to verify VACUUM stuff but I have another question
here. What about queries having functions that access the same
relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
access the relation t1)? Now, here I think because the relation 't1'
is already opened, it might use the same value of blocks from the
shared cache even though the snapshot for relation 't1' when accessed
from func() might be different. Am, I missing something, or is it
dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept. The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size. There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch. In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

Attachments:

v2-0001-WIP-Track-relation-sizes-in-shared-memory.patchtext/x-patch; charset=US-ASCII; name=v2-0001-WIP-Track-relation-sizes-in-shared-memory.patchDownload

From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory.

Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.  "Dirty" SMgrSharedRelations can only be evicted
after being synced to disk.  Goals:

1.  Provide faster lookups of relation sizes, cutting down on lseek()
calls.  This supercedes the recovery-only caching added recently, and
replaces preexisting FSM and VM caching schemes.

2.  Stop trusting the operating system to keep track of the size of
files that we have recently extended, until fsync() has been called.

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

XXX currently reusing the array of locks also used for buffer mapping,
need to define some more in lwlocks.c...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c     |   1 -
 src/backend/access/heap/visibilitymap.c   |  28 +-
 src/backend/catalog/storage.c             |   2 -
 src/backend/storage/freespace/freespace.c |  35 +-
 src/backend/storage/ipc/ipci.c            |   3 +
 src/backend/storage/smgr/md.c             |  10 +
 src/backend/storage/smgr/smgr.c           | 500 ++++++++++++++++++++--
 src/backend/utils/misc/guc.c              |  11 +
 src/include/storage/smgr.h                |  17 +-
 9 files changed, 518 insertions(+), 89 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 54e47b810f..702951e487 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bc..8fe29ecae7 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -547,6 +547,7 @@ static Buffer
 vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 {
 	Buffer		buf;
+	BlockNumber nblocks;
 
 	/*
 	 * We might not have opened the relation at the smgr level yet, or we
@@ -557,20 +558,12 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the visibility map fork yet, check it
-	 * first.
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
-	{
-		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
-	}
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+	nblocks = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -636,17 +629,10 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+	/* * Create the file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..45f62030aa 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -290,8 +290,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	for (int i = 0; i <= MAX_FORKNUM; ++i)
-		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..629923c4d4 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -531,29 +531,18 @@ static Buffer
 fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
+	BlockNumber nblocks;
 	Buffer		buf;
 
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the FSM yet, check it first.  Also
-	 * recheck if the requested block seems to be past end, since our cached
-	 * value might be stale.  (We send smgr inval messages on truncation, but
-	 * not on extension.)
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
-	{
-		/* Invalidate the cache so smgrnblocks asks the kernel. */
-		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
-	}
+	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+		nblocks = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+	else
+		nblocks = 0;
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -621,18 +610,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	/* Create the FSM file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..cfc55d3691 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -121,6 +121,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
 		size = add_size(size, dsm_estimate_size());
+		size = add_size(size, smgr_shmem_size());
 		size = add_size(size, BufferShmemSize());
 		size = add_size(size, LockShmemSize());
 		size = add_size(size, PredicateLockShmemSize());
@@ -212,6 +213,8 @@ CreateSharedMemoryAndSemaphores(void)
 
 	dsm_shmem_init();
 
+	smgr_shmem_init();
+
 	/*
 	 * Set up xlog, clog, and buffers
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1d4aa482cc..6228447e31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1332,6 +1332,16 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	/*
+	 * XXX: We could have an interface smgrbeginclean() that would return true
+	 * if it has managed to set SR_SYNC and clean SR_JUST_DIRTIED, and then if
+	 * so, after our sync runs we could call smgrfinishclean() with our
+	 * success/failure report, which would clear SR_DIRTY if SR_JUST_DIRTIED
+	 * hasn't been set in the meantime.  But... how can we know if *this*
+	 * segment is one that represents an extension?  SR_DIRTY is just
+	 * interested in syncing extended blocks.
+	 */
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..b5dc370bba 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,13 +19,65 @@
 
 #include "access/xlog.h"
 #include "lib/ilist.h"
+#include "port/atomics.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/shmem.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/*
+ * An entry in the hash table that allows us to look up objects in the
+ * SMgrSharedRelation pool by rnode (+ backend).
+ */
+typedef struct SMgrSharedRelationMapping
+{
+	RelFileNodeBackend rnode;
+	int				index;
+} SMgrSharedRelationMapping;
+
+/*
+ * An object in shared memory tracks the size of the forks of a relation.
+ */
+struct SMgrSharedRelation
+{
+	RelFileNodeBackend rnode;
+	BlockNumber		nblocks[MAX_FORKNUM + 1];
+	pg_atomic_uint32 flags;
+};
+
+/* For now, we borrow the buffer managers array of locks.  XXX fixme */
+#define SR_PARTITIONS NUM_BUFFER_PARTITIONS
+#define SR_PARTITION_LOCK(hash) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET].lock)
+
+/* Flags. */
+#define SR_LOCKED					0x01
+#define SR_VALID					0x02
+
+/* Each forknum gets its own dirty, syncing and just dirtied bits. */
+#define SR_DIRTY(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
+#define SR_SYNCING(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
+#define SR_JUST_DIRTIED(forknum)	(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))
+
+/* Masks to test if any forknum is currently dirty or syncing. */
+#define SR_SYNCING_MASK				(((SR_SYNCING(MAX_FORKNUM + 1) - 1) ^ (SR_SYNCING(0) - 1)))
+#define SR_DIRTY_MASK				(((SR_DIRTY(MAX_FORKNUM + 1) - 1) ^ (SR_DIRTY(0) - 1)))
+
+/* Extract the lowest dirty forknum from flags (there must be at least one). */
+#define SR_GET_ONE_DIRTY(mask)		pg_rightmost_one_pos32((((mask) >> 2) & (SR_DIRTY_MASK >> 2)))
+
+typedef struct SMgrSharedRelationPool
+{
+	pg_atomic_uint32 next;
+	SMgrSharedRelation objects[FLEXIBLE_ARRAY_MEMBER];
+} SMgrSharedRelationPool;
+
+static SMgrSharedRelationPool *sr_pool;
+static HTAB *sr_mapping_table;
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -98,6 +150,408 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/* GUCs. */
+int smgr_shared_relations = 1000;
+
+/*
+ * Try to get the size of a relation's fork by looking it up in the mapping
+ * table with a shared lock.  This will succeed if the SMgrRelation already
+ * exists.
+ */
+static BlockNumber
+smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	BlockNumber result = InvalidBlockNumber;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		result = sr->nblocks[forknum];
+	}
+	LWLockRelease(mapping_lock);
+
+	return result;
+}
+
+/*
+ * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
+ * only a few instructions.  The return value is the current set of flags,
+ * which may be modified and then passed to smgr_unlock_sr() to be atomically
+ * when the lock is released.
+ */
+static uint32
+smgr_lock_sr(SMgrSharedRelation *sr)
+{
+
+	for (;;)
+	{
+		uint32	old_flags = pg_atomic_read_u32(&sr->flags);
+		uint32	flags;
+
+		if (!(old_flags & SR_LOCKED))
+		{
+			flags = old_flags | SR_LOCKED;
+			if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
+				return flags;
+		}
+	}
+	return 0; /* unreachable */
+}
+
+/*
+ * Unlock a SMgrSharedRelation, atomically updating its flags at the same
+ * time.
+ */
+static void
+smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
+{
+	pg_write_barrier();
+	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
+}
+
+/*
+ * Allocate a new invalid SMgrSharedRelation, and return it locked.
+ *
+ * The replacement algorithm is a simple FIFO design with no second chance for
+ * now.
+ */
+static SMgrSharedRelation *
+smgr_alloc_sr(void)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32 index;
+	LWLock *mapping_lock;
+	uint32 flags;
+	RelFileNodeBackend rnode;
+	uint32 hash;
+
+ retry:
+	/* Lock the next one in clock-hand order. */
+	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+	sr = &sr_pool->objects[index];
+	flags = smgr_lock_sr(sr);
+
+	/* If it's unused, can return it, still locked, immediately. */
+	if (!(flags & SR_VALID))
+		return sr;
+
+	/*
+	 * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
+	 * locks, but we need to do it in that order, so we'll unlock the SR
+	 * first.
+	 */
+	rnode = sr->rnode;
+	smgr_unlock_sr(sr, flags);
+
+	hash = get_hash_value(sr_mapping_table, &rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (!mapping || mapping->index != index)
+	{
+		/* Too slow, it's gone or now points somewhere else.  Go around. */
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/* We will lock the SR for just a few instructions. */
+	flags = smgr_lock_sr(sr);
+	Assert(flags & SR_VALID);
+
+	/*
+	 * If another backend is currently syncing any fork, we aren't allowed to
+	 * evict it, and waiting for it would be pointless because that other
+	 * backend already plans to allocate it.  So go around.
+	 */
+	if (flags & SR_SYNCING_MASK)
+	{
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/*
+	 * We will sync every fork that is dirty, and then we'll try to
+	 * evict it.
+	 */
+	while (flags & SR_DIRTY_MASK)
+	{
+		SMgrRelation reln;
+		ForkNumber forknum = SR_GET_ONE_DIRTY(flags);
+
+		/* Set the sync bit, clear the just-dirtied bit and unlock. */
+		flags |= SR_SYNCING(forknum);
+		flags &= ~SR_JUST_DIRTIED(forknum);
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+
+		/*
+		 * Perform the I/O, with no locks held.
+		 * XXX It sucks that we fsync every segment, not just the ones that need it...
+		 */
+		reln = smgropen(rnode.node, rnode.backend);
+		smgrimmedsync(reln, forknum);
+
+		/*
+		 * Reacquire the locks.  The object can't have been evicted,
+		 * because we set a sync bit.
+		 * XXX And what if it's dropped?
+		 */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_SYNCING(forknum));
+		if (flags & SR_JUST_DIRTIED(forknum))
+		{
+			/*
+			 * Someone else dirtied it while we were syncing, so we can't mark
+			 * it clean.  Let's give up on this SR and go around again.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+
+		/* This fork is clean! */
+		flags &= ~SR_SYNCING(forknum);
+		flags &= ~SR_DIRTY(forknum);
+	}
+
+	/*
+	 * If we made it this far, there are no dirty forks, so we're now allowed
+	 * to evict the SR from the pool and the mapping table.
+	 */
+	flags &= ~SR_VALID;
+	smgr_unlock_sr(sr, flags);
+
+	/* Remove from the mapping table. */
+	hash_search_with_hash_value(sr_mapping_table,
+								&rnode,
+								hash,
+								HASH_REMOVE,
+								NULL);
+	LWLockRelease(mapping_lock);
+
+	/*
+	 * XXX: We unlock while doing HASH_REMOVE on principle.  Maybe it'd be OK
+	 * to hold it now that the clock hand is far away and there is no way
+	 * anyone can look up this SR through buffer mapping table.
+	 */
+	flags = smgr_lock_sr(sr);
+	if (flags & SR_VALID)
+	{
+		/* Oops, someone else got it. */
+		smgr_unlock_sr(sr, flags);
+		goto retry;
+	}
+
+	return sr;
+}
+
+/*
+ * Set the number of blocks in a relation, in shared memory, and optionally
+ * also mark the relation as "dirty" (meaning the it must be fsync'd before it
+ * can be evicted).
+ */
+static void
+smgrnblocks_update(SMgrRelation reln,
+				   ForkNumber forknum,
+				   BlockNumber nblocks,
+				   bool mark_dirty)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr = NULL;
+	uint32		hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+ retry:
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		flags = smgr_lock_sr(sr);
+		if (mark_dirty)
+		{
+			/*
+			 * Extend and truncate clobber the value, and there are no races
+			 * to worry about because they can have higher level exclusive
+			 * locking on the relation.
+			 */
+			sr->nblocks[forknum] = nblocks;
+
+			/*
+			 * Mark it dirty, and if it's currently being sync'd, make sure it
+			 * stays dirty after that completes.
+			 */
+			flags |= SR_DIRTY(forknum);
+			if (flags & SR_SYNCING(forknum))
+				flags |= SR_JUST_DIRTIED(forknum);
+		}
+		else if (!(flags & SR_DIRTY(forknum)))
+		{
+			/*
+			 * We won't clobber a dirty value with a non-dirty update, to
+			 * avoid races against concurrent extend/truncate, but we can
+			 * install a new clean value.
+			 */
+			sr->nblocks[forknum] = nblocks;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+	LWLockRelease(mapping_lock);
+
+	/* If we didn't find it, then we'll need to allocate one. */
+	if (!sr)
+	{
+		bool found;
+
+		sr = smgr_alloc_sr();
+
+		/* Upgrade to exclusive lock so we can create a mapping. */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		mapping = hash_search_with_hash_value(sr_mapping_table,
+											  &reln->smgr_rnode,
+											  hash,
+											  HASH_ENTER,
+											  &found);
+		if (!found)
+		{
+			/* Success!  Initialize. */
+			mapping->index = sr - sr_pool->objects;
+			smgr_unlock_sr(sr, SR_VALID);
+			sr->rnode = reln->smgr_rnode;
+			for (int i = 0; i <= MAX_FORKNUM; ++i)
+				sr->nblocks[i] = InvalidBlockNumber;
+			LWLockRelease(mapping_lock);
+		}
+		else
+		{
+			/* Someone beat us to it.  Go around again. */
+			smgr_unlock_sr(sr, 0);		/* = not valid */
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+	}
+}
+
+static void
+smgr_drop_sr(SMgrRelation reln)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+
+	retry:
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_VALID);
+
+		if (flags & SR_SYNCING_MASK)
+		{
+			/*
+			 * Oops, someone's syncing one of its forks; nothing to do but
+			 * wait.
+			 */
+			smgr_unlock_sr(sr, flags);
+			pg_usleep(100000);
+			goto retry;
+		}
+
+		/* Mark it invalid and drop the mapping. */
+		smgr_unlock_sr(sr, ~SR_VALID);
+		hash_search_with_hash_value(sr_mapping_table,
+									&reln->smgr_rnode,
+									hash,
+									HASH_REMOVE,
+									NULL);
+	}
+	LWLockRelease(mapping_lock);
+}
+
+size_t
+smgr_shmem_size(void)
+{
+	size_t size = 0;
+
+	size = add_size(size,
+					sizeof(offsetof(SMgrSharedRelationPool, objects) +
+						   sizeof(SMgrSharedRelation) * smgr_shared_relations));
+	size = add_size(size,
+					hash_estimate_size(smgr_shared_relations,
+									   sizeof(SMgrSharedRelationMapping)));
+
+	return size;
+}
+
+void
+smgr_shmem_init(void)
+{
+	HASHCTL		info;
+	bool found;
+
+	info.keysize = sizeof(RelFileNodeBackend);
+	info.entrysize = sizeof(SMgrSharedRelationMapping);
+	info.num_partitions = SR_PARTITIONS;
+	sr_mapping_table = ShmemInitHash("SMgrSharedRelation Mapping Table",
+									 smgr_shared_relations,
+									 smgr_shared_relations,
+									 &info,
+									 HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+
+	sr_pool = ShmemInitStruct("SMgrSharedRelation Pool",
+							  offsetof(SMgrSharedRelationPool, objects) +
+							  sizeof(SMgrSharedRelation) * smgr_shared_relations,
+							  &found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&sr_pool->next, 0);
+		for (uint32 i = 0; i < smgr_shared_relations; ++i)
+		{
+			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+		}
+	}
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -175,8 +629,6 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		for (int i = 0; i <= MAX_FORKNUM; ++i)
-			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -247,6 +699,9 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
+	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+		return true;
+
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
@@ -430,6 +885,9 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 		CacheInvalidateSmgr(rnodes[i]);
 
+	for (i = 0; i < nrels; i++)
+		smgr_drop_sr(rels[i]);
+
 	/*
 	 * Delete the physical file(s).
 	 *
@@ -465,16 +923,7 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
-
-	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
-	 * value isn't as expected, just invalidate it so the next call asks the
-	 * kernel.
-	 */
-	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
-	else
-		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+	smgrnblocks_update(reln, forknum, blocknum + 1, true);
 }
 
 /*
@@ -550,16 +999,16 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
-	/*
-	 * For now, we only use cached values in recovery due to lack of a shared
-	 * invalidation mechanism for changes in file size.
-	 */
-	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
-		return reln->smgr_cached_nblocks[forknum];
+	/* Can we get the answer from shared memory with only a share lock? */
+	result = smgrnblocks_shared(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
 
+	/* Ask the kernel. */
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
-	reln->smgr_cached_nblocks[forknum] = result;
+	/* Update the value in shared memory for faster service next time. */
+	smgrnblocks_update(reln, forknum, result, false);
 
 	return result;
 }
@@ -600,19 +1049,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
-		/* Make the cached size is invalid if we encounter an error. */
-		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
-
-		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_fsm_nblocks and
-		 * smgr_vm_nblocks, and these ones too at the next command boundary.
-		 * But these ensure they aren't outright wrong until then.
-		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		smgrnblocks_update(reln, forknum[i], nblocks[i], true);
 	}
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..175ac555e6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
+#include "storage/smgr.h"
 #include "storage/standby.h"
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
@@ -2445,6 +2446,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_shared_relations", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the number of shared relation objects in memory at one time."),
+			NULL
+		},
+		&smgr_shared_relations,
+		1000, 64, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..2eea69213b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,13 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/* GUCs. */
+extern int smgr_shared_relations;
+
+/* Definition private to smgr.c. */
+struct SMgrSharedRelation;
+typedef struct SMgrSharedRelation SMgrSharedRelation;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -44,14 +51,7 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
-	/*
-	 * The following fields are reset to InvalidBlockNumber upon a cache flush
-	 * event, and hold the last known size for each fork.  This information is
-	 * currently only reliable during recovery, since there is no cache
-	 * invalidation for fork extension.
-	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
@@ -77,6 +77,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern size_t smgr_shmem_size(void);
+extern void smgr_shmem_init(void);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
-- 
2.20.1

v2-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchtext/x-patch; charset=US-ASCII; name=v2-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchDownload

From eaf9efd4df49e62b991c7c7347613f64c6269336 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 19 Nov 2020 17:09:51 +1300
Subject: [PATCH v2 2/2] WIP: Provide a lock-free fast path for smgrnblocks().

SMgrRelation objects gain a pointer to the last known SMgrSharedRelation
object.  There are three concurrency hazards to worry about:

1.  The SMgrSharedRelation pointed to could be evicted at any time.  We
record a generation number, and insert memory barriers so that we can
detect that and fall back to a slower path.

2.  The nblocks value is read without any locking, which is atomic
because it is a 32 bit value, and PostgreSQL requires atomic 32 bit
reads generally.

3.  The nblocks value must be free enough for scans, extension,
truncatation and dropping buffers, because the those operations all
executed memory barriers when it acquired a snapshot to scan (which
doesn't need to see blocks added after that) or an exclusive heavyweight
lock to extend, truncate or drop.  XXX right?

XXX That's the idea anyway, but this is just a sketch; almost certainly
incomplet and inkorrect.

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 src/backend/storage/smgr/smgr.c | 58 +++++++++++++++++++++++++++++++--
 src/include/storage/smgr.h      |  4 +++
 2 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b5dc370bba..c96fa46785 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -48,6 +48,7 @@ struct SMgrSharedRelation
 	RelFileNodeBackend rnode;
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
+	pg_atomic_uint64 generation;		/* mapping change */
 };
 
 /* For now, we borrow the buffer managers array of locks.  XXX fixme */
@@ -153,6 +154,40 @@ static void smgrshutdown(int code, Datum arg);
 /* GUCs. */
 int smgr_shared_relations = 1000;
 
+/*
+ * Try to get the size of a relation's fork without locking.
+ */
+static BlockNumber
+smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelation *sr = reln->smgr_shared;
+	BlockNumber result;
+
+	if (sr)
+	{
+		pg_read_barrier();
+
+		/* We can load int-sized values atomically without special measures. */
+		Assert(sizeof(sr->nblocks[forknum]) == sizeof(uint32));
+		result = sr->nblocks[forknum];
+
+		/*
+		 * With a read barrier between the loads, we can check that the object
+		 * still refers to the same rnode before trusting the answer.
+		 */
+		pg_read_barrier();
+		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+			return result;
+
+		/*
+		 * The generation doesn't match, the shared relation must have been
+		 * evicted since we got a pointer to it.  We'll need to do more work.
+		 */
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Try to get the size of a relation's fork by looking it up in the mapping
  * table with a shared lock.  This will succeed if the SMgrRelation already
@@ -180,6 +215,10 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 	{
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
+
+		/* We can take the fast path until this SR is eventually evicted. */
+		reln->smgr_shared = sr;
+		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
 	}
 	LWLockRelease(mapping_lock);
 
@@ -337,9 +376,14 @@ smgr_alloc_sr(void)
 
 	/*
 	 * If we made it this far, there are no dirty forks, so we're now allowed
-	 * to evict the SR from the pool and the mapping table.
+	 * to evict the SR from the pool and the mapping table.  Make sure that
+	 * smgrnblocks_fast() sees that its pointer is now invalid by bumping the
+	 * generation.
 	 */
 	flags &= ~SR_VALID;
+	pg_atomic_write_u64(&sr->generation,
+						pg_atomic_read_u64(&sr->generation) + 1);
+	pg_write_barrier();
 	smgr_unlock_sr(sr, flags);
 
 	/* Remove from the mapping table. */
@@ -447,6 +491,8 @@ smgrnblocks_update(SMgrRelation reln,
 			mapping->index = sr - sr_pool->objects;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
+			pg_atomic_write_u64(&sr->generation,
+								pg_atomic_read_u64(&sr->generation) + 1);
 			for (int i = 0; i <= MAX_FORKNUM; ++i)
 				sr->nblocks[i] = InvalidBlockNumber;
 			LWLockRelease(mapping_lock);
@@ -549,6 +595,7 @@ smgr_shmem_init(void)
 		for (uint32 i = 0; i < smgr_shared_relations; ++i)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
 		}
 	}
 }
@@ -629,6 +676,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
+		reln->smgr_shared = NULL;
+		reln->smgr_shared_generation = 0;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+	if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
 		return true;
 
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
@@ -999,6 +1048,11 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
+	/* Can we get the answer from shared memory without locking? */
+	result = smgrnblocks_fast(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
+
 	/* Can we get the answer from shared memory with only a share lock? */
 	result = smgrnblocks_shared(reln, forknum);
 	if (result != InvalidBlockNumber)
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 2eea69213b..b2cab478a2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -51,6 +51,10 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
+	/* pointer to shared object, valid if non-NULL and generation matches */
+	SMgrSharedRelation *smgr_shared;
+	uint64		smgr_shared_generation;
+
 	BlockNumber smgr_targblock; /* current insertion target block */
 
 	/* additional public fields may someday exist here */
-- 
2.20.1

zhihui.fan1213@gmail.com

about 5 years ago

In reply to: Thomas Munro (#38)

Re: Cache relation sizes?

On Thu, Nov 19, 2020 at 1:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Yeah, it is good to verify VACUUM stuff but I have another question
here. What about queries having functions that access the same
relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
access the relation t1)? Now, here I think because the relation 't1'
is already opened, it might use the same value of blocks from the
shared cache even though the snapshot for relation 't1' when accessed
from func() might be different. Am, I missing something, or is it
dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept. The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size. There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch. In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

Hi Thomas:

Thank you for working on it.

I spent one day studying the patch and I want to talk about one question
for now.
What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will
happen
if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?

--
Best Regards
Andy Fan

thomas.munro@gmail.com

about 5 years ago

In reply to: Andy Fan (#39)

Re: Cache relation sizes?

Hi Andy,

On Thu, Dec 17, 2020 at 7:29 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

I spent one day studying the patch and I want to talk about one question for now.
What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen
if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?

The underlying theory of the patch is that after fsync() succeeds, the
new size is permanent, but before that it could change at any time
because of asynchronous activities in the kernel*. Therefore, if you
want to evict the size from shared memory, you have to fsync() the
file first. I used smgrimmedsync() to do that.

*I suspect that it can really only shrink with network file systems
such as NFS and GlusterFS, where the kernel has to take liberties to
avoid network round trips for every file extension, but I don't know
the details on all 9 of our supported OSes and standards aren't very
helpful for understanding buffered I/O.

zhihui.fan1213@gmail.com

about 5 years ago

In reply to: Thomas Munro (#40)

Re: Cache relation sizes?

Hi Thomas,

Thank you for your quick response.

On Thu, Dec 17, 2020 at 3:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi Andy,

On Thu, Dec 17, 2020 at 7:29 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

I spent one day studying the patch and I want to talk about one question

for now.

What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what

will happen

if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?

The underlying theory of the patch is that after fsync() succeeds, the
new size is permanent, but before that it could change at any time
because of asynchronous activities in the kernel*. Therefore, if you
want to evict the size from shared memory, you have to fsync() the
file first. I used smgrimmedsync() to do that.

*I suspect that it can really only shrink with network file systems
such as NFS and GlusterFS, where the kernel has to take liberties to
avoid network round trips for every file extension, but I don't know
the details on all 9 of our supported OSes and standards aren't very
helpful for understanding buffered I/O.

Let me try to understand your point. Suppose process 1 extends a file to
2 blocks from 1 block, and fsync is not called, then a). the lseek *may*
still
return 1 based on the comments in the ReadBuffer_common ("because
of buggy Linux kernels that sometimes return an seek SEEK_END result
that doesn't account for a recent write"). b). When this patch is
introduced,
we would get 2 blocks for sure if we can get the size from cache. c). After
user reads the 2 blocks from cache and then the cache is evicted, and user
reads the blocks again, it is possible to get 1 again.

Do we need to consider case a) in this patch? and Is case c). the situation
you
want to avoid and do we have to introduce fsync to archive this? Basically I
think case a) should not happen by practice so we don't need to handle case
c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
I'm not opposed to adding them, I am just interested in the background in
case
you are also willing to discuss it.

I would suggest the below change for smgrnblock_fast, FYI

@@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
                /*
                 * The generation doesn't match, the shared relation must
have been
                 * evicted since we got a pointer to it.  We'll need to do
more work.
+                * and invalid the fast path for next time.
                 */
+               reln->smgr_shared = NULL;
        }

--
Best Regards
Andy Fan

陈佳昕(步真)

buzhen.cjx@alibaba-inc.com

about 5 years ago

In reply to: Thomas Munro (#38)

回复：Re: Cache relation sizes?

Hi Thomas:

I studied your patch these days and found there might be a problem.
When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.

Buzhen

------------------原始邮件 ------------------
发件人:Thomas Munro <thomas.munro@gmail.com>
发送时间:Tue Dec 22 19:57:35 2020
收件人:Amit Kapila <amit.kapila16@gmail.com>
抄送:Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题:Re: Cache relation sizes?
On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, it is good to verify VACUUM stuff but I have another question
here. What about queries having functions that access the same
relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
access the relation t1)? Now, here I think because the relation 't1'
is already opened, it might use the same value of blocks from the
shared cache even though the snapshot for relation 't1' when accessed
from func() might be different. Am, I missing something, or is it
dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept. The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size. There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch. In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

thomas.munro@gmail.com

about 5 years ago

In reply to: Andy Fan (#41)

Re: Cache relation sizes?

On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Let me try to understand your point. Suppose process 1 extends a file to
2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
return 1 based on the comments in the ReadBuffer_common ("because
of buggy Linux kernels that sometimes return an seek SEEK_END result
that doesn't account for a recent write"). b). When this patch is introduced,
we would get 2 blocks for sure if we can get the size from cache. c). After
user reads the 2 blocks from cache and then the cache is evicted, and user
reads the blocks again, it is possible to get 1 again.

Do we need to consider case a) in this patch? and Is case c). the situation you
want to avoid and do we have to introduce fsync to archive this? Basically I
think case a) should not happen by practice so we don't need to handle case
c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
I'm not opposed to adding them, I am just interested in the background in case
you are also willing to discuss it.

Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics. I'll try to summarise here.

It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1]/messages/by-id/CA+hUKGLZTfKuXir9U4JQkY=k3Tb6M_3toGrGOK_fa2d4MPQQ_w@mail.gmail.com, and we know that there have
historically been problems even with local filesystems[2]/messages/by-id/BAY116-F146A38326C5630EA33B36BD12B0@phx.gbl, leading to
that comment about buggy kernels. I am only interested in the
behaviour I can demonstrate, because I believe Linux is working as
intended when it does that (hypothetical/unknown bugs would probably
be helped by this approach too, but I'm not interested in speculating
about that beyond these parentheses).

So, why should we consider this? It's true that commit ffae5cc5a60's
change to ReadBuffer_common() already prevents us from eating data by
overwriting an existing buffer due to this phenomenon, but there are
other problems:

1. A system that in this state can give an incorrect answer to a
query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of
a scan, so it might not see recently added blocks containing committed
data. Hopefully we'll panic when we try to create a checkpoint and
see errors, but we could be giving bad results for a while.

2. Commitfest entry #2319 needs an accurate size, or it might leave
stray blocks in the buffer pool that will cause failures or corruption
later.

It's true that we could decide not to worry about this, and perhaps
even acknowledge in some official way that PostgreSQL doesn't work
reliably on some kinds of filesystems. But in this prototype I
thought it was fun to try to fix our very high rate of lseek()
syscalls and this reliability problem, at the same time. I also
speculate that we'll eventually have other reasons to want a
demand-paged per-relation shared memory object.

I would suggest the below change for smgrnblock_fast, FYI

@@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
/*
* The generation doesn't match, the shared relation must have been
* evicted since we got a pointer to it.  We'll need to do more work.
+                * and invalid the fast path for next time.
*/
+               reln->smgr_shared = NULL;
}

Thanks, makes sense -- I'll add this to the next version.

[1]: /messages/by-id/CA+hUKGLZTfKuXir9U4JQkY=k3Tb6M_3toGrGOK_fa2d4MPQQ_w@mail.gmail.com
[2]: /messages/by-id/BAY116-F146A38326C5630EA33B36BD12B0@phx.gbl

thomas.munro@gmail.com

about 5 years ago

In reply to: 陈佳昕(步真) (#42)

Re: Re: Cache relation sizes?

On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:

I studied your patch these days and found there might be a problem.
When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.

Hi Buzhen,

Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.

zhihui.fan1213@gmail.com

about 5 years ago

In reply to: Thomas Munro (#43)

Re: Cache relation sizes?

On Thu, Dec 24, 2020 at 6:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com>
wrote:

Let me try to understand your point. Suppose process 1 extends a file to
2 blocks from 1 block, and fsync is not called, then a). the lseek *may*

still

return 1 based on the comments in the ReadBuffer_common ("because
of buggy Linux kernels that sometimes return an seek SEEK_END result
that doesn't account for a recent write"). b). When this patch is

introduced,

we would get 2 blocks for sure if we can get the size from cache. c).

After

user reads the 2 blocks from cache and then the cache is evicted, and

user

reads the blocks again, it is possible to get 1 again.

Do we need to consider case a) in this patch? and Is case c). the

situation you

want to avoid and do we have to introduce fsync to archive this?

Basically I

think case a) should not happen by practice so we don't need to handle

case

c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
I'm not opposed to adding them, I am just interested in the background

in case

you are also willing to discuss it.

Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics. I'll try to summarise here.

It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1],

Thanks for the explaintain. After I read that thread, I am more confused
now.
In [1]/messages/by-id/CA+hUKGLZTfKuXir9U4JQkY=k3Tb6M_3toGrGOK_fa2d4MPQQ_w@mail.gmail.com, we have the below test result.

$ cat magic_shrinking_file.c
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main()
{
int fd;
char buffer[8192] = {0};

fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND);
if (fd < 0) {
perror("open");
return EXIT_FAILURE;
}
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
printf("write(...) = %zd\n", write(fd, buffer, sizeof(buffer)));
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
printf("fsync(...) = %d\n", fsync(fd));
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));

return EXIT_SUCCESS;
}
$ cc magic_shrinking_file.c
$ ./a.out
lseek(..., SEEK_END) = 9670656
write(...) = 8192
lseek(..., SEEK_END) = 9678848
fsync(...) = -1
lseek(..., SEEK_END) = 9670656

I got 2 information from above. a) before the fsync, the lseek(fd, 0,
SEEK_END)
returns a correct value, however after the fsync, it returns a wrong
value. b).
the fsync failed(return values is -1) in the above case. I'm more confused
because of point a). Since the fsync can't correct some wrong value, what
is the point to call fsync in this patch (I agree that it is good to fix
this
reliability problem within this patch, but I'm doubtful if fsync can help
in
this case). Am I missing something obviously?

I read your patch with details some days ago and feel it is in pretty good
shape.
and I think you are clear about the existing issues very well. like a).
smgr_alloc_sr use a
FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really.
c). and
looks like you have some concern about the concurrency issue, which I
didn't find
anything wrong. d). how to handle the DIRTY sr during evict. I planned
to recheck
item c) before this reply, but looks like the time is not permitted. May I
know what
Is your plan about this patch?

[1]: /messages/by-id/CA+hUKGLZTfKuXir9U4JQkY=k3Tb6M_3toGrGOK_fa2d4MPQQ_w@mail.gmail.com
/messages/by-id/CA+hUKGLZTfKuXir9U4JQkY=k3Tb6M_3toGrGOK_fa2d4MPQQ_w@mail.gmail.com

--
Best Regards
Andy Fan

陈佳昕(步真)

buzhen.cjx@alibaba-inc.com

about 5 years ago

In reply to: Thomas Munro (#44)

回复：Re: Re: Cache relation sizes?

Hi Thomas
I found some other problems which I want to share my change with you to make you confirm.
<1> I changed the code in smgr_alloc_sr to avoid dead lock.

LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
flags = smgr_lock_sr(sr);
Assert(flags & SR_SYNCING(forknum));
+ flags &= ~SR_SYNCING(forknum);
if (flags & SR_JUST_DIRTIED(forknum))
{
/*
* Someone else dirtied it while we were syncing, so we can't mark
* it clean. Let's give up on this SR and go around again.
*/
smgr_unlock_sr(sr, flags);
LWLockRelease(mapping_lock);
goto retry;
}
/* This fork is clean! */
- flags &= ~SR_SYNCING(forknum);
flags &= ~SR_DIRTY(forknum);
}

In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above.

<2> I changed the code in smgr_drop_sr to avoid some corner cases
/* Mark it invalid and drop the mapping. */
smgr_unlock_sr(sr, ~SR_VALID);
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+  sr->nblocks[forknum] = InvalidBlockNumber;
hash_search_with_hash_value(sr_mapping_table,
       &reln->smgr_rnode,
       hash,
       HASH_REMOVE,
       NULL);

smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Thanks a lot.
Buzhen

------------------原始邮件 ------------------
发件人:Thomas Munro <thomas.munro@gmail.com>
发送时间:Tue Dec 29 22:52:59 2020
收件人:陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
抄送:Amit Kapila <amit.kapila16@gmail.com>, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题:Re: Re: Cache relation sizes?
On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:

I studied your patch these days and found there might be a problem.
When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.

Hi Buzhen,

Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.

andres@anarazel.de

about 5 years ago

In reply to: Thomas Munro (#38)

Re: Cache relation sizes?

Hi,

On 2020-11-19 18:01:14 +1300, Thomas Munro wrote:

From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory.

Introduce a fixed size pool of SMgrSharedRelation objects. A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.

As noted over IM, I think it'd be good if we could avoid the need for a
new GUC here and instead derive it from other settings. E.g.
max_files_per_process * MaxBackends
or maybe
max_files_per_process * log(MaxBackends)
(on the basis that it's unlikely that each backend uses completely
separate files).

Given the size of the structs it seems that the memory overhead of this
should be acceptable?

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

As I suggested on the same call, I think it might make sense to
integrate this with the checkpointer sync queue. Have checkpointer try
to keep some a certain fraction of the entries clean (or even make them
unused?), by processing all the sync requests for that
relfilnode. That'd allow to a) only sync segments we actually need to
sync b) syncing those segments again at the end of the checkpoint.

Another use-case I have is that I'd like to make file-extension better
than it currently is. There's two main challenges:

- Having bulk extension add pages to the FSM isn't actually a good
solution, as that has a high likelihood of multiple backends ending up
trying to insert into the same page. If we had, in SMgrSharedRelation,
a 'physical' relation size and a 'logical' relation size, we could do
better, and hand out pre-allocated pages to exactly one backend.

- Doing bulk-extension when there's no other space left leads to large
pile-ups of processes needing extension lock -> a large latency
increase. Instead we should opportunistically start to do extension
when we've used-up most of the bulk allocated space.

+/*
+ * An entry in the hash table that allows us to look up objects in the
+ * SMgrSharedRelation pool by rnode (+ backend).
+ */
+typedef struct SMgrSharedRelationMapping
+{
+	RelFileNodeBackend rnode;
+	int				index;
+} SMgrSharedRelationMapping;

Maybe add a note that index points into
SMgrSharedRelationPool->objects[]? And why you chose to not store
SMgrSharedRelation directly in the hashtable (makes sense to me, still
worth commenting on imo).

+/* Flags. */
+#define SR_LOCKED					0x01
+#define SR_VALID					0x02
+
+/* Each forknum gets its own dirty, syncing and just dirtied bits. */
+#define SR_DIRTY(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
+#define SR_SYNCING(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
+#define SR_JUST_DIRTIED(forknum)	(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))

Hm - do we really want all that stored in a single flag value?

+/*
+ * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
+ * only a few instructions.  The return value is the current set of flags,
+ * which may be modified and then passed to smgr_unlock_sr() to be atomically
+ * when the lock is released.
+ */
+static uint32
+smgr_lock_sr(SMgrSharedRelation *sr)
+{
+
+	for (;;)
+	{
+		uint32	old_flags = pg_atomic_read_u32(&sr->flags);
+		uint32	flags;
+
+		if (!(old_flags & SR_LOCKED))
+		{
+			flags = old_flags | SR_LOCKED;
+			if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
+				return flags;
+		}
+	}
+	return 0; /* unreachable */
+}

Since this is a spinlock, you should use the spin delay infrastructure
(c.f. LWLockWaitListLock()). Otherwise it'll react terribly if there
ever ended up contention.

What's the reason not to use a per-entry lwlock instead?

Naming: I find it weird to have smgr_ as a prefix and _sr as a
postfix. I'd go for smgr_sr_$op.

+/*
+ * Unlock a SMgrSharedRelation, atomically updating its flags at the same
+ * time.
+ */
+static void
+smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
+{
+	pg_write_barrier();
+	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
+}

This assumes there never are atomic modification of flags without
holding the spinlock (or a cmpxchg loop testing whether somebody is
holding the lock).Is that right / good?

+/*
+ * Allocate a new invalid SMgrSharedRelation, and return it locked.
+ *
+ * The replacement algorithm is a simple FIFO design with no second chance for
+ * now.
+ */
+static SMgrSharedRelation *
+smgr_alloc_sr(void)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32 index;
+	LWLock *mapping_lock;
+	uint32 flags;
+	RelFileNodeBackend rnode;
+	uint32 hash;
+
+ retry:
+	/* Lock the next one in clock-hand order. */
+	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+	sr = &sr_pool->objects[index];
+	flags = smgr_lock_sr(sr);

Modulo is pretty expensive - I'd force smgr_shared_relations to be a
power-of-two and then do a & smgr_shared_relations_mask initialized to
(smgr_shared_relations - 1).

+	/* If it's unused, can return it, still locked, immediately. */
+	if (!(flags & SR_VALID))
+		return sr;
+
+	/*
+	 * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
+	 * locks, but we need to do it in that order, so we'll unlock the SR
+	 * first.
+	 */
+	rnode = sr->rnode;
+	smgr_unlock_sr(sr, flags);
+
+	hash = get_hash_value(sr_mapping_table, &rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (!mapping || mapping->index != index)
+	{
+		/* Too slow, it's gone or now points somewhere else.  Go around. */
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/* We will lock the SR for just a few instructions. */
+	flags = smgr_lock_sr(sr);
+	Assert(flags & SR_VALID);

So the replacement here is purely based on when the relation was
accessed first, not when it was accessed last. Probably good enough, but
seems worth a comment.

3. The nblocks value must be free enough for scans, extension,
truncatation and dropping buffers, because the those operations all
executed memory barriers when it acquired a snapshot to scan (which
doesn't need to see blocks added after that) or an exclusive heavyweight
lock to extend, truncate or drop. XXX right?

"free enough"?

I am not sure this holds for non-mvcc scans?

@@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
bool
smgrexists(SMgrRelation reln, ForkNumber forknum)
{
-	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+	if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
return true;

ISTM that all these places referencing _fast is leaking an
implementation detail (but not very far) - seems like that should be an
implementation detail for smgrnblocks() which can then call out to
_fast/_shared or such.

Greetings,

Andres Freund

thomas.munro@gmail.com

about 5 years ago

In reply to: Andy Fan (#45)

Re: Cache relation sizes?

On Mon, Dec 28, 2020 at 5:24 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

lseek(..., SEEK_END) = 9670656
write(...) = 8192
lseek(..., SEEK_END) = 9678848
fsync(...) = -1
lseek(..., SEEK_END) = 9670656

I got 2 information from above. a) before the fsync, the lseek(fd, 0, SEEK_END)
returns a correct value, however after the fsync, it returns a wrong value. b).
the fsync failed(return values is -1) in the above case. I'm more confused
because of point a). Since the fsync can't correct some wrong value, what
is the point to call fsync in this patch (I agree that it is good to fix this
reliability problem within this patch, but I'm doubtful if fsync can help in
this case). Am I missing something obviously?

The point of the fsync() call isn't to correct the value, it's to find
out if it is safe to drop the SR object from shared memory because the
operating system has promised that it won't forget about the new size.
If that fails, we'll report an ERROR or PANIC (depending on
data_sync_retry), but not drop the SR.

By the way, you don't need fsync(fd) for the size to change, as shown
by the other experiment in that other thread that just did sleep(60).
It might also happen asynchronously. fsync(fd) forces it, but also
tells you about errors.

I read your patch with details some days ago and feel it is in pretty good shape.
and I think you are clear about the existing issues very well. like a). smgr_alloc_sr use a
FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really. c). and

Yeah, it probably should have its own bank of locks (I wish they were
easier to add, via lwlocknames.txt or similar).

looks like you have some concern about the concurrency issue, which I didn't find
anything wrong. d). how to handle the DIRTY sr during evict. I planned to recheck

So yeah, mostly this discussion has been about not trusting lseek()
and using our own tracking of relation size instead. But there is a
separate question about how "fresh" the value needs to be. The
question is: under what circumstances could the unlocked read of
nblocks from shared memory give you a value that isn't fresh enough
for your snapshot/scan? This involves thinking about implicit memory
barriers.

The idea of RECENTLY_DIRTIED is similar to what we do for buffers:
while you're trying to "clean" it (in this case by calling fsync())
someone else can extend the relation again, and in that case we don't
know whether the new extension was included in the fsync() or not, so
we can't clear the DIRTY flag when it completes.

item c) before this reply, but looks like the time is not permitted. May I know what
Is your plan about this patch?

Aside from details, bugs etc discussed in this thread, one major piece
remains incomplete: something in the background to "clean" SRs so that
foreground processes rarely have to block.

thomas.munro@gmail.com

about 5 years ago

In reply to: 陈佳昕(步真) (#46)

2 attachment(s)

Re: Re: Re: Cache relation sizes?

On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:

I found some other problems which I want to share my change with you to make you confirm.
<1> I changed the code in smgr_alloc_sr to avoid dead lock.

LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
flags = smgr_lock_sr(sr);
Assert(flags & SR_SYNCING(forknum));
+ flags &= ~SR_SYNCING(forknum);
if (flags & SR_JUST_DIRTIED(forknum))
{
/*
* Someone else dirtied it while we were syncing, so we can't mark
* it clean. Let's give up on this SR and go around again.
*/
smgr_unlock_sr(sr, flags);
LWLockRelease(mapping_lock);
goto retry;
}
/* This fork is clean! */
- flags &= ~SR_SYNCING(forknum);
flags &= ~SR_DIRTY(forknum);
}

In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above.

Right. Thanks!

I also added smgrdropdb() to handle DROP DATABASE (the problem you
reported in your previous email).

While tweaking that code, I fixed it so that it uses a condition
variable to wait (instead of the silly sleep loop) when it needs to
drop an SR that is being sync'd. Also, it now releases the mapping
lock while it's doing that, and requires it on retry.

<2> I changed the code in smgr_drop_sr to avoid some corner cases
/* Mark it invalid and drop the mapping. */
smgr_unlock_sr(sr, ~SR_VALID);
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+  sr->nblocks[forknum] = InvalidBlockNumber;
hash_search_with_hash_value(sr_mapping_table,
&reln->smgr_rnode,
hash,
HASH_REMOVE,
NULL);
smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Hmm. I think it might be better to increment sr->generation. That
was already done in the "eviction" code path, where other processes
might still have references to the SR object, and I don't think it's
possible for anyone to access a dropped relation, but I suppose it's
more consistent to do that anyway. Fixed.

Thanks for the review!

Attachments:

v3-0001-WIP-Track-relation-sizes-in-shared-memory.patchtext/x-patch; charset=UTF-8; name=v3-0001-WIP-Track-relation-sizes-in-shared-memory.patchDownload

From 1d49405b41b3cdd6baf6f29219f18c6adefb5547 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH v3 1/2] WIP: Track relation sizes in shared memory.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.  "Dirty" SMgrSharedRelations can only be evicted
after being synced to disk.  Goals:

1.  Provide faster lookups of relation sizes, cutting down on lseek()
calls.  This supercedes the recovery-only caching added recently, and
replaces preexisting FSM and VM caching schemes.

2.  Stop trusting the operating system to keep track of the size of
files that we have recently extended, until fsync() has been called.

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

XXX currently reusing the array of locks also used for buffer mapping,
need to define some more in lwlocks.c...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
Reviewed-by: 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
---
 contrib/pg_visibility/pg_visibility.c     |   1 -
 src/backend/access/heap/visibilitymap.c   |  28 +-
 src/backend/catalog/storage.c             |   2 -
 src/backend/commands/dbcommands.c         |   3 +
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/freespace/freespace.c |  35 +-
 src/backend/storage/ipc/ipci.c            |   3 +
 src/backend/storage/smgr/md.c             |  10 +
 src/backend/storage/smgr/smgr.c           | 540 ++++++++++++++++++++--
 src/backend/utils/misc/guc.c              |  11 +
 src/include/pgstat.h                      |   1 +
 src/include/storage/smgr.h                |  18 +-
 12 files changed, 566 insertions(+), 89 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 54e47b810f..702951e487 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bc..8fe29ecae7 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -547,6 +547,7 @@ static Buffer
 vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 {
 	Buffer		buf;
+	BlockNumber nblocks;
 
 	/*
 	 * We might not have opened the relation at the smgr level yet, or we
@@ -557,20 +558,12 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the visibility map fork yet, check it
-	 * first.
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
-	{
-		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
-	}
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+	nblocks = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -636,17 +629,10 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+	/* * Create the file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..45f62030aa 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -290,8 +290,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	for (int i = 0; i <= MAX_FORKNUM; ++i)
-		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..e046334bdb 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -970,6 +970,9 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	DropDatabaseBuffers(db_id);
 
+	/* Drop SMGR relations. */
+	smgrdropdb(db_id);
+
 	/*
 	 * Tell the stats collector to forget it immediately, too.
 	 */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 123369f4fa..2905bfb472 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4039,6 +4039,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
+		case WAIT_EVENT_SMGR_DROP_SYNC:
+			event_name = "SmgrDropSync";
+			break;
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..629923c4d4 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -531,29 +531,18 @@ static Buffer
 fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
+	BlockNumber nblocks;
 	Buffer		buf;
 
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the FSM yet, check it first.  Also
-	 * recheck if the requested block seems to be past end, since our cached
-	 * value might be stale.  (We send smgr inval messages on truncation, but
-	 * not on extension.)
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
-	{
-		/* Invalidate the cache so smgrnblocks asks the kernel. */
-		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
-	}
+	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+		nblocks = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+	else
+		nblocks = 0;
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -621,18 +610,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	/* Create the FSM file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..cfc55d3691 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -121,6 +121,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
 		size = add_size(size, dsm_estimate_size());
+		size = add_size(size, smgr_shmem_size());
 		size = add_size(size, BufferShmemSize());
 		size = add_size(size, LockShmemSize());
 		size = add_size(size, PredicateLockShmemSize());
@@ -212,6 +213,8 @@ CreateSharedMemoryAndSemaphores(void)
 
 	dsm_shmem_init();
 
+	smgr_shmem_init();
+
 	/*
 	 * Set up xlog, clog, and buffers
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 9889ad6ad8..7310a12b6a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1359,6 +1359,16 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	/*
+	 * XXX: We could have an interface smgrbeginclean() that would return true
+	 * if it has managed to set SR_SYNC and clean SR_JUST_DIRTIED, and then if
+	 * so, after our sync runs we could call smgrfinishclean() with our
+	 * success/failure report, which would clear SR_DIRTY if SR_JUST_DIRTIED
+	 * hasn't been set in the meantime.  But... how can we know if *this*
+	 * segment is one that represents an extension?  SR_DIRTY is just
+	 * interested in syncing extended blocks.
+	 */
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 072bdd118f..1da2dfc250 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,13 +19,68 @@
 
 #include "access/xlog.h"
 #include "lib/ilist.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/shmem.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/*
+ * An entry in the hash table that allows us to look up objects in the
+ * SMgrSharedRelation pool by rnode (+ backend).
+ */
+typedef struct SMgrSharedRelationMapping
+{
+	RelFileNodeBackend rnode;
+	int				index;
+} SMgrSharedRelationMapping;
+
+/*
+ * An object in shared memory tracks the size of the forks of a relation.
+ */
+struct SMgrSharedRelation
+{
+	RelFileNodeBackend rnode;
+	BlockNumber		nblocks[MAX_FORKNUM + 1];
+	pg_atomic_uint32 flags;
+};
+
+/* For now, we borrow the buffer managers array of locks.  XXX fixme */
+#define SR_PARTITIONS NUM_BUFFER_PARTITIONS
+#define SR_PARTITION_LOCK(hash) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET].lock)
+
+/* Flags. */
+#define SR_LOCKED					0x01
+#define SR_VALID					0x02
+
+/* Each forknum gets its own dirty, syncing and just dirtied bits. */
+#define SR_DIRTY(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
+#define SR_SYNCING(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
+#define SR_JUST_DIRTIED(forknum)	(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))
+
+/* Masks to test if any forknum is currently dirty or syncing. */
+#define SR_SYNCING_MASK				(((SR_SYNCING(MAX_FORKNUM + 1) - 1) ^ (SR_SYNCING(0) - 1)))
+#define SR_DIRTY_MASK				(((SR_DIRTY(MAX_FORKNUM + 1) - 1) ^ (SR_DIRTY(0) - 1)))
+
+/* Extract the lowest dirty forknum from flags (there must be at least one). */
+#define SR_GET_ONE_DIRTY(mask)		pg_rightmost_one_pos32((((mask) >> 2) & (SR_DIRTY_MASK >> 2)))
+
+typedef struct SMgrSharedRelationPool
+{
+	ConditionVariable sync_flags_cleared;
+	pg_atomic_uint32 next;
+	SMgrSharedRelation objects[FLEXIBLE_ARRAY_MEMBER];
+} SMgrSharedRelationPool;
+
+static SMgrSharedRelationPool *sr_pool;
+static HTAB *sr_mapping_table;
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -98,6 +153,417 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/* GUCs. */
+int smgr_shared_relations = 1000;
+
+/*
+ * Try to get the size of a relation's fork by looking it up in the mapping
+ * table with a shared lock.  This will succeed if the SMgrRelation already
+ * exists.
+ */
+static BlockNumber
+smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	BlockNumber result = InvalidBlockNumber;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		result = sr->nblocks[forknum];
+	}
+	LWLockRelease(mapping_lock);
+
+	return result;
+}
+
+/*
+ * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
+ * only a few instructions.  The return value is the current set of flags,
+ * which may be modified and then passed to smgr_unlock_sr() to be atomically
+ * when the lock is released.
+ */
+static uint32
+smgr_lock_sr(SMgrSharedRelation *sr)
+{
+
+	for (;;)
+	{
+		uint32	old_flags = pg_atomic_read_u32(&sr->flags);
+		uint32	flags;
+
+		if (!(old_flags & SR_LOCKED))
+		{
+			flags = old_flags | SR_LOCKED;
+			if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
+				return flags;
+		}
+	}
+	return 0; /* unreachable */
+}
+
+/*
+ * Unlock a SMgrSharedRelation, atomically updating its flags at the same
+ * time.
+ */
+static void
+smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
+{
+	pg_write_barrier();
+	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
+}
+
+/*
+ * Allocate a new invalid SMgrSharedRelation, and return it locked.
+ *
+ * The replacement algorithm is a simple FIFO design with no second chance for
+ * now.
+ */
+static SMgrSharedRelation *
+smgr_alloc_sr(void)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32 index;
+	LWLock *mapping_lock;
+	uint32 flags;
+	RelFileNodeBackend rnode;
+	uint32 hash;
+
+ retry:
+	/* Lock the next one in clock-hand order. */
+	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+	sr = &sr_pool->objects[index];
+	flags = smgr_lock_sr(sr);
+
+	/* If it's unused, can return it, still locked, immediately. */
+	if (!(flags & SR_VALID))
+		return sr;
+
+	/*
+	 * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
+	 * locks, but we need to do it in that order, so we'll unlock the SR
+	 * first.
+	 */
+	rnode = sr->rnode;
+	smgr_unlock_sr(sr, flags);
+
+	hash = get_hash_value(sr_mapping_table, &rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (!mapping || mapping->index != index)
+	{
+		/* Too slow, it's gone or now points somewhere else.  Go around. */
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/* We will lock the SR for just a few instructions. */
+	flags = smgr_lock_sr(sr);
+	Assert(flags & SR_VALID);
+
+	/*
+	 * If another backend is currently syncing any fork, we aren't allowed to
+	 * evict it, and waiting for it would be pointless because that other
+	 * backend already plans to allocate it.  So go around.
+	 */
+	if (flags & SR_SYNCING_MASK)
+	{
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/*
+	 * We will sync every fork that is dirty, and then we'll try to
+	 * evict it.
+	 */
+	while (flags & SR_DIRTY_MASK)
+	{
+		SMgrRelation reln;
+		ForkNumber forknum = SR_GET_ONE_DIRTY(flags);
+
+		/* Set the sync bit, clear the just-dirtied bit and unlock. */
+		flags |= SR_SYNCING(forknum);
+		flags &= ~SR_JUST_DIRTIED(forknum);
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+
+		/*
+		 * Perform the I/O, with no locks held.
+		 * XXX It sucks that we fsync every segment, not just the ones that need it...
+		 */
+		reln = smgropen(rnode.node, rnode.backend);
+		smgrimmedsync(reln, forknum);
+
+		/*
+		 * Reacquire the locks.  The object can't have been evicted,
+		 * because we set a sync bit.
+		 */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_SYNCING(forknum));
+		flags &= ~SR_SYNCING(forknum);
+		if (flags & SR_JUST_DIRTIED(forknum))
+		{
+			/*
+			 * Someone else dirtied it while we were syncing, so we can't mark
+			 * it clean.  Let's give up on this SR and go around again.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+
+		/* This fork is clean! */
+		flags &= ~SR_DIRTY(forknum);
+	}
+
+	/*
+	 * If we made it this far, there are no dirty forks, so we're now allowed
+	 * to evict the SR from the pool and the mapping table.
+	 */
+	flags &= ~SR_VALID;
+	smgr_unlock_sr(sr, flags);
+
+	/*
+	 * If any callers to smgr_sr_drop() or smgr_sr_drop_db() had the misfortune
+	 * to have to wait for us to finish syncing, we can now wake them up.
+	 */
+	ConditionVariableBroadcast(&sr_pool->sync_flags_cleared);
+
+	/* Remove from the mapping table. */
+	hash_search_with_hash_value(sr_mapping_table,
+								&rnode,
+								hash,
+								HASH_REMOVE,
+								NULL);
+	LWLockRelease(mapping_lock);
+
+	/*
+	 * XXX: We unlock while doing HASH_REMOVE on principle.  Maybe it'd be OK
+	 * to hold it now that the clock hand is far away and there is no way
+	 * anyone can look up this SR through buffer mapping table.
+	 */
+	flags = smgr_lock_sr(sr);
+	if (flags & SR_VALID)
+	{
+		/* Oops, someone else got it. */
+		smgr_unlock_sr(sr, flags);
+		goto retry;
+	}
+
+	return sr;
+}
+
+/*
+ * Set the number of blocks in a relation, in shared memory, and optionally
+ * also mark the relation as "dirty" (meaning the it must be fsync'd before it
+ * can be evicted).
+ */
+static void
+smgrnblocks_update(SMgrRelation reln,
+				   ForkNumber forknum,
+				   BlockNumber nblocks,
+				   bool mark_dirty)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr = NULL;
+	uint32		hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+ retry:
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		flags = smgr_lock_sr(sr);
+		if (mark_dirty)
+		{
+			/*
+			 * Extend and truncate clobber the value, and there are no races
+			 * to worry about because they can have higher level exclusive
+			 * locking on the relation.
+			 */
+			sr->nblocks[forknum] = nblocks;
+
+			/*
+			 * Mark it dirty, and if it's currently being sync'd, make sure it
+			 * stays dirty after that completes.
+			 */
+			flags |= SR_DIRTY(forknum);
+			if (flags & SR_SYNCING(forknum))
+				flags |= SR_JUST_DIRTIED(forknum);
+		}
+		else if (!(flags & SR_DIRTY(forknum)))
+		{
+			/*
+			 * We won't clobber a dirty value with a non-dirty update, to
+			 * avoid races against concurrent extend/truncate, but we can
+			 * install a new clean value.
+			 */
+			sr->nblocks[forknum] = nblocks;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+	LWLockRelease(mapping_lock);
+
+	/* If we didn't find it, then we'll need to allocate one. */
+	if (!sr)
+	{
+		bool found;
+
+		sr = smgr_alloc_sr();
+
+		/* Upgrade to exclusive lock so we can create a mapping. */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		mapping = hash_search_with_hash_value(sr_mapping_table,
+											  &reln->smgr_rnode,
+											  hash,
+											  HASH_ENTER,
+											  &found);
+		if (!found)
+		{
+			/* Success!  Initialize. */
+			mapping->index = sr - sr_pool->objects;
+			smgr_unlock_sr(sr, SR_VALID);
+			sr->rnode = reln->smgr_rnode;
+			for (int i = 0; i <= MAX_FORKNUM; ++i)
+				sr->nblocks[i] = InvalidBlockNumber;
+			LWLockRelease(mapping_lock);
+		}
+		else
+		{
+			/* Someone beat us to it.  Go around again. */
+			smgr_unlock_sr(sr, 0);		/* = not valid */
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+	}
+}
+
+static void
+smgr_drop_sr(RelFileNodeBackend *rnode)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+retry:
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_VALID);
+
+		if (flags & SR_SYNCING_MASK)
+		{
+			/*
+			 * Oops, someone's syncing one of its forks; nothing to do but
+			 * wait until that's finished.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			ConditionVariableSleep(&sr_pool->sync_flags_cleared,
+								   WAIT_EVENT_SMGR_DROP_SYNC);
+			goto retry;
+		}
+		ConditionVariableCancelSleep();
+
+		/* Mark it invalid and drop the mapping. */
+		smgr_unlock_sr(sr, ~SR_VALID);
+		hash_search_with_hash_value(sr_mapping_table,
+									rnode,
+									hash,
+									HASH_REMOVE,
+									NULL);
+	}
+	LWLockRelease(mapping_lock);
+}
+
+size_t
+smgr_shmem_size(void)
+{
+	size_t size = 0;
+
+	size = add_size(size,
+					sizeof(offsetof(SMgrSharedRelationPool, objects) +
+						   sizeof(SMgrSharedRelation) * smgr_shared_relations));
+	size = add_size(size,
+					hash_estimate_size(smgr_shared_relations,
+									   sizeof(SMgrSharedRelationMapping)));
+
+	return size;
+}
+
+void
+smgr_shmem_init(void)
+{
+	HASHCTL		info;
+	bool found;
+
+	info.keysize = sizeof(RelFileNodeBackend);
+	info.entrysize = sizeof(SMgrSharedRelationMapping);
+	info.num_partitions = SR_PARTITIONS;
+	sr_mapping_table = ShmemInitHash("SMgrSharedRelation Mapping Table",
+									 smgr_shared_relations,
+									 smgr_shared_relations,
+									 &info,
+									 HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+
+	sr_pool = ShmemInitStruct("SMgrSharedRelation Pool",
+							  offsetof(SMgrSharedRelationPool, objects) +
+							  sizeof(SMgrSharedRelation) * smgr_shared_relations,
+							  &found);
+	if (!found)
+	{
+		ConditionVariableInit(&sr_pool->sync_flags_cleared);
+		pg_atomic_init_u32(&sr_pool->next, 0);
+		for (uint32 i = 0; i < smgr_shared_relations; ++i)
+		{
+			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+		}
+	}
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -174,8 +640,6 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		for (int i = 0; i <= MAX_FORKNUM; ++i)
-			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -246,6 +710,9 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
+	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+		return true;
+
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
@@ -429,6 +896,9 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 		CacheInvalidateSmgr(rnodes[i]);
 
+	for (i = 0; i < nrels; i++)
+		smgr_drop_sr(&rels[i]->smgr_rnode);
+
 	/*
 	 * Delete the physical file(s).
 	 *
@@ -464,16 +934,7 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
-
-	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
-	 * value isn't as expected, just invalidate it so the next call asks the
-	 * kernel.
-	 */
-	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
-	else
-		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+	smgrnblocks_update(reln, forknum, blocknum + 1, true);
 }
 
 /*
@@ -549,16 +1010,16 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
-	/*
-	 * For now, we only use cached values in recovery due to lack of a shared
-	 * invalidation mechanism for changes in file size.
-	 */
-	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
-		return reln->smgr_cached_nblocks[forknum];
+	/* Can we get the answer from shared memory with only a share lock? */
+	result = smgrnblocks_shared(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
 
+	/* Ask the kernel. */
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
-	reln->smgr_cached_nblocks[forknum] = result;
+	/* Update the value in shared memory for faster service next time. */
+	smgrnblocks_update(reln, forknum, result, false);
 
 	return result;
 }
@@ -599,19 +1060,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
-		/* Make the cached size is invalid if we encounter an error. */
-		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
-
-		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_fsm_nblocks and
-		 * smgr_vm_nblocks, and these ones too at the next command boundary.
-		 * But these ensure they aren't outright wrong until then.
-		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		smgrnblocks_update(reln, forknum[i], nblocks[i], true);
 	}
 }
 
@@ -644,6 +1094,34 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+/*
+ * When a database is dropped, we have to find and throw away all its
+ * SMgrSharedRelation objects.
+ */
+void
+smgrdropdb(Oid database)
+{
+	for (int i = 0; i < smgr_shared_relations; ++i)
+	{
+		SMgrSharedRelation *sr = &sr_pool->objects[i];
+		RelFileNodeBackend rnode;
+		uint32 flags;
+
+		/* Hold the spinlock only while we copy out the rnode of matches. */
+		flags = smgr_lock_sr(sr);
+		if ((flags & SR_VALID) && sr->rnode.node.dbNode == database)
+		{
+			rnode = sr->rnode;
+			smgr_unlock_sr(sr, flags);
+
+			/* Drop, if it's still valid. */
+			smgr_drop_sr(&rnode);
+		}
+		else
+			smgr_unlock_sr(sr, flags);
+	}
+}
+
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 878fcc2236..b548b4fd96 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
+#include "storage/smgr.h"
 #include "storage/standby.h"
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
@@ -2435,6 +2436,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_shared_relations", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the number of shared relation objects in memory at one time."),
+			NULL
+		},
+		&smgr_shared_relations,
+		1000, 64, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..d13cc20279 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -963,6 +963,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
+	WAIT_EVENT_SMGR_DROP_SYNC,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..7031c70196 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,13 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/* GUCs. */
+extern int smgr_shared_relations;
+
+/* Definition private to smgr.c. */
+struct SMgrSharedRelation;
+typedef struct SMgrSharedRelation SMgrSharedRelation;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -44,14 +51,7 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
-	/*
-	 * The following fields are reset to InvalidBlockNumber upon a cache flush
-	 * event, and hold the last known size for each fork.  This information is
-	 * currently only reliable during recovery, since there is no cache
-	 * invalidation for fork extension.
-	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
@@ -77,6 +77,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern size_t smgr_shmem_size(void);
+extern void smgr_shmem_init(void);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -102,6 +105,7 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrdropdb(Oid database);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
-- 
2.20.1

v3-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchtext/x-patch; charset=UTF-8; name=v3-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchDownload

From 2c3c98711bdadd0846c4dfff3aea6cdf84f6df44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 19 Nov 2020 17:09:51 +1300
Subject: [PATCH v3 2/2] WIP: Provide a lock-free fast path for smgrnblocks().
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

SMgrRelation objects gain a pointer to the last known SMgrSharedRelation
object.  There are three concurrency hazards to worry about:

1.  The SMgrSharedRelation pointed to could be evicted at any time.  We
record a generation number, and insert memory barriers so that we can
detect that and fall back to a slower path.

2.  The nblocks value is read without any locking, which is atomic
because it is a 32 bit value, and PostgreSQL requires atomic 32 bit
reads generally.

3.  The nblocks value must be fresh enough for scans, extension,
truncatation and dropping buffers, because the those operations all
executed memory barriers when it acquired a snapshot to scan (which
doesn't need to see blocks added after that) or an exclusive heavyweight
lock to extend, truncate or drop.  XXX right?

XXX That's the idea anyway, but this is just a sketch; almost certainly
incomplet and inkorrect.

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
Reviewed-by: 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
---
 src/backend/storage/smgr/smgr.c | 63 +++++++++++++++++++++++++++++++--
 src/include/storage/smgr.h      |  4 +++
 2 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 1da2dfc250..5c17e3e74d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,7 @@ struct SMgrSharedRelation
 	RelFileNodeBackend rnode;
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
+	pg_atomic_uint64 generation;		/* mapping change */
 };
 
 /* For now, we borrow the buffer managers array of locks.  XXX fixme */
@@ -156,6 +157,40 @@ static void smgrshutdown(int code, Datum arg);
 /* GUCs. */
 int smgr_shared_relations = 1000;
 
+/*
+ * Try to get the size of a relation's fork without locking.
+ */
+static BlockNumber
+smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelation *sr = reln->smgr_shared;
+	BlockNumber result;
+
+	if (sr)
+	{
+		pg_read_barrier();
+
+		/* We can load int-sized values atomically without special measures. */
+		Assert(sizeof(sr->nblocks[forknum]) == sizeof(uint32));
+		result = sr->nblocks[forknum];
+
+		/*
+		 * With a read barrier between the loads, we can check that the object
+		 * still refers to the same rnode before trusting the answer.
+		 */
+		pg_read_barrier();
+		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+			return result;
+
+		/*
+		 * The generation doesn't match, the shared relation must have been
+		 * evicted since we got a pointer to it.  We'll need to do more work.
+		 */
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Try to get the size of a relation's fork by looking it up in the mapping
  * table with a shared lock.  This will succeed if the SMgrRelation already
@@ -183,6 +218,10 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 	{
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
+
+		/* We can take the fast path until this SR is eventually evicted. */
+		reln->smgr_shared = sr;
+		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
 	}
 	LWLockRelease(mapping_lock);
 
@@ -339,9 +378,14 @@ smgr_alloc_sr(void)
 
 	/*
 	 * If we made it this far, there are no dirty forks, so we're now allowed
-	 * to evict the SR from the pool and the mapping table.
+	 * to evict the SR from the pool and the mapping table.  Make sure that
+	 * smgrnblocks_fast() sees that its pointer is now invalid by bumping the
+	 * generation.
 	 */
 	flags &= ~SR_VALID;
+	pg_atomic_write_u64(&sr->generation,
+						pg_atomic_read_u64(&sr->generation) + 1);
+	pg_write_barrier();
 	smgr_unlock_sr(sr, flags);
 
 	/*
@@ -455,6 +499,8 @@ smgrnblocks_update(SMgrRelation reln,
 			mapping->index = sr - sr_pool->objects;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
+			pg_atomic_write_u64(&sr->generation,
+								pg_atomic_read_u64(&sr->generation) + 1);
 			for (int i = 0; i <= MAX_FORKNUM; ++i)
 				sr->nblocks[i] = InvalidBlockNumber;
 			LWLockRelease(mapping_lock);
@@ -509,6 +555,11 @@ retry:
 		}
 		ConditionVariableCancelSleep();
 
+		/* Make sure smgrnblocks_fast() knows it's invalidated. */
+		pg_atomic_write_u64(&sr->generation,
+							pg_atomic_read_u64(&sr->generation) + 1);
+		pg_write_barrier();
+
 		/* Mark it invalid and drop the mapping. */
 		smgr_unlock_sr(sr, ~SR_VALID);
 		hash_search_with_hash_value(sr_mapping_table,
@@ -561,6 +612,7 @@ smgr_shmem_init(void)
 		for (uint32 i = 0; i < smgr_shared_relations; ++i)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
 		}
 	}
 }
@@ -640,6 +692,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
+		reln->smgr_shared = NULL;
+		reln->smgr_shared_generation = 0;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -710,7 +764,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+	if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
 		return true;
 
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
@@ -1010,6 +1064,11 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
+	/* Can we get the answer from shared memory without locking? */
+	result = smgrnblocks_fast(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
+
 	/* Can we get the answer from shared memory with only a share lock? */
 	result = smgrnblocks_shared(reln, forknum);
 	if (result != InvalidBlockNumber)
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7031c70196..fec13cb7bc 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -51,6 +51,10 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
+	/* pointer to shared object, valid if non-NULL and generation matches */
+	SMgrSharedRelation *smgr_shared;
+	uint64		smgr_shared_generation;
+
 	BlockNumber smgr_targblock; /* current insertion target block */
 
 	/* additional public fields may someday exist here */
-- 
2.20.1

thomas.munro@gmail.com

about 5 years ago

In reply to: Thomas Munro (#49)

Re: Re: Re: Cache relation sizes?

On Wed, Dec 30, 2020 at 5:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

and requires it on retry

s/requires/reacquires/

陈佳昕(步真)

buzhen.cjx@alibaba-inc.com

about 5 years ago

In reply to: Thomas Munro (#49)

Re: Cache relation sizes?

Hi, Thomas

Is some scenes the same as dropdb for smgr cache.
1. drop tablespace for master. Should smgrdroptbs function be added in DropTableSpace function ?
2. drop database for slave. smgrdropdb in dbase_redo.
3. drop tablespace for slave. smgrdroptbs in tblspc_redo.

Buzhen

------------------原始邮件 ------------------
发件人:Thomas Munro <thomas.munro@gmail.com>
发送时间:Fri Jan 8 00:56:17 2021
收件人:陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
抄送:Amit Kapila <amit.kapila16@gmail.com>, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题:Re: Cache relation sizes?
On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:

I found some other problems which I want to share my change with you to make you confirm.
<1> I changed the code in smgr_alloc_sr to avoid dead lock.

LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
flags = smgr_lock_sr(sr);
Assert(flags & SR_SYNCING(forknum));
+ flags &= ~SR_SYNCING(forknum);
if (flags & SR_JUST_DIRTIED(forknum))
{
/*
* Someone else dirtied it while we were syncing, so we can't mark
* it clean. Let's give up on this SR and go around again.
*/
smgr_unlock_sr(sr, flags);
LWLockRelease(mapping_lock);
goto retry;
}
/* This fork is clean! */
- flags &= ~SR_SYNCING(forknum);
flags &= ~SR_DIRTY(forknum);
}

In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above.

Right. Thanks!

I also added smgrdropdb() to handle DROP DATABASE (the problem you
reported in your previous email).

While tweaking that code, I fixed it so that it uses a condition
variable to wait (instead of the silly sleep loop) when it needs to
drop an SR that is being sync'd. Also, it now releases the mapping
lock while it's doing that, and requires it on retry.

<2> I changed the code in smgr_drop_sr to avoid some corner cases
/* Mark it invalid and drop the mapping. */
smgr_unlock_sr(sr, ~SR_VALID);
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+  sr->nblocks[forknum] = InvalidBlockNumber;
hash_search_with_hash_value(sr_mapping_table,
&reln->smgr_rnode,
hash,
HASH_REMOVE,
NULL);
smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Hmm. I think it might be better to increment sr->generation. That
was already done in the "eviction" code path, where other processes
might still have references to the SR object, and I don't think it's
possible for anyone to access a dropped relation, but I suppose it's
more consistent to do that anyway. Fixed.

Thanks for the review!

陈佳昕(步真)

buzhen.cjx@alibaba-inc.com

almost 5 years ago

In reply to: 陈佳昕(步真) (#51)

1 attachment(s)

回复：Re: Cache relation sizes?

Hi Thomas
I want to share a patch with you, I change the replacement algorithm from fifo to a simple lru.

Buzhen

Attachments:

0001-update-fifo-to-lru-to-sweep-a-valid-cache.patchapplication/octet-streamDownload

From 5fa10ceb1cb2cf005f2f9dffecd38dec2e6017c0 Mon Sep 17 00:00:00 2001
From: "buzhen.cjx" <buzhen.cjx@alibaba-inc.com>
Date: Mon, 18 Jan 2021 21:47:21 +0800
Subject: [PATCH] update fifo to lru to sweep a valid cache

---
 src/backend/storage/smgr/smgr.c | 57 +++++++++++++++++++++++++++------
 src/backend/utils/misc/guc.c    | 10 ++++++
 src/include/storage/smgr.h      |  2 ++
 3 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index e4f92c3d182..9a9df7f711c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -138,7 +138,8 @@ static void smgrshutdown(int code, Datum arg);
 
 /* GUCs. */
-int smgr_shared_relations = 1000;
+int smgr_shared_relations = 10000;
+int smgr_pool_sweep_times = 32;
 
 /*
  * Try to get the size of a relation's fork without locking.
@@ -162,8 +163,13 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
 		 * still refers to the same rnode before trusting the answer.
 		 */
 		pg_read_barrier();
+
 		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+		{
+			/* no necessary to use a atomic operation, usecount can be imprecisely */
+			sr->usecount++;
 			return result;
+		}
 
 		/*
 		 * The generation doesn't match, the shared relation must have been
@@ -203,6 +209,9 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
 
+		/* no necessary to use a atomic operation, usecount can be imprecisely */
+		sr->usecount++;
+
 		/* We can take the fast path until this SR is eventually evicted. */
 		reln->smgr_shared = sr;
 		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
@@ -248,6 +257,36 @@ smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
 	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
 }
 
+/* LRU: sweep to find a sr to use. Just lock the sr when it returns */
+static SMgrSharedRelation *
+smgr_pool_sweep(void)
+{
+	SMgrSharedRelation *sr;
+	uint32 index;
+	uint32 flags;
+	int sr_used_count = 0;
+
+	for (;;)
+	{
+		/* Lock the next one in clock-hand order. */
+		index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+		sr = &sr_pool->objects[index];
+		flags = smgr_lock_sr(sr);
+		if (--(sr->usecount) <= 0)
+		{
+			elog(DEBUG5, "find block cache in sweep cache, use it");
+			return sr;
+		}
+		if (++sr_used_count >= smgr_shared_relations * smgr_pool_sweep_times)
+		{
+			elog(LOG, "all the block caches are used frequently, use a random one");
+			sr = &sr_pool->objects[random() % smgr_shared_relations];
+			return sr;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+}
+
 /*
  * Allocate a new invalid SMgrSharedRelation, and return it locked.
  *
@@ -268,11 +307,8 @@ smgr_alloc_sr(void)
 	uint32 hash;
 
  retry:
-	/* Lock the next one in clock-hand order. */
-	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
-	sr = &sr_pool->objects[index];
-	flags = smgr_lock_sr(sr);
-
+	sr = smgr_pool_sweep();
+	flags = pg_atomic_read_u32(&sr->flags);
 	/* If it's unused, can return it, still locked, immediately. */
 	if (!(flags & SR_VALID))
 		return sr;
@@ -282,6 +318,7 @@ smgr_alloc_sr(void)
 	 * locks, but we need to do it in that order, so we'll unlock the SR
 	 * first.
 	 */
+	index = sr - sr_pool->objects;
 	rnode = sr->rnode;
 	smgr_unlock_sr(sr, flags);
 
@@ -475,6 +512,8 @@ smgrnblocks_update(SMgrRelation reln,
 			 */
 			sr->nblocks[forknum] = nblocks;
 		}
+		if (sr->usecount < smgr_pool_sweep_times)
+			sr->usecount++;
 		smgr_unlock_sr(sr, flags);
 	}
 	LWLockRelease(mapping_lock);
@@ -497,6 +536,7 @@ smgrnblocks_update(SMgrRelation reln,
 		{
 			/* Success!  Initialize. */
 			mapping->index = sr - sr_pool->objects;
+			sr->usecount = 1;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
 			pg_atomic_write_u64(&sr->generation,
@@ -573,6 +613,7 @@ retry:
 			sr->nblocks[forknum] = InvalidBlockNumber;
 
 		/* Mark it invalid and drop the mapping. */
+		sr->usecount = 0;
 		smgr_unlock_sr(sr, ~SR_VALID);
 		hash_search_with_hash_value(sr_mapping_table,
 									rnode,
@@ -626,6 +667,7 @@ smgr_shmem_init(void)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
 			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
+			sr_pool->objects[i].usecount = 0;
 		}
 	}
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b8a679de10f..d3e9699f552 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4545,6 +4545,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_pool_sweep_times", PGC_SIGHUP, UNGROUPED,
+			gettext_noop("Sets the times for smgr pool to clock sweep"),
+			NULL
+		},
+		&smgr_pool_sweep_times,
+		32, 16, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index fd99695520c..4feefdf824b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -25,6 +25,7 @@
 
 /* GUCs. */
 extern int smgr_shared_relations;
+extern int smgr_pool_sweep_times;
 
 /*
  * An object in shared memory tracks the size of the forks of a relation.
@@ -35,6 +36,7 @@ struct SMgrSharedRelation
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
 	pg_atomic_uint64 generation;		/* mapping change */
+	int64 usecount;		/* used for clock sweep */
 };
 
 /* Definition private to smgr.c. */
-- 
2.23.0

david@pgmasters.net

almost 5 years ago

In reply to: 陈佳昕(步真) (#52)

Re: 回复：Re: Cache relation sizes?

Hi Thomas,

On 1/18/21 10:42 PM, 陈佳昕(步真) wrote:

I want to share a patch with you, I change the replacement algorithm
from fifo to a simple lru.

What do you think of this change?

Also, your patch set from [1]/messages/by-id/CA+hUKGLQU=txW6maxZJoU3v-a42i=_UN2YLeB+aW6JuKcYrUBA@mail.gmail.com no longer applies (and of course this
latest patch is confusing the tester as well).

Marking Waiting on Author.

Regards,
--
-David
david@pgmasters.net

[1]: /messages/by-id/CA+hUKGLQU=txW6maxZJoU3v-a42i=_UN2YLeB+aW6JuKcYrUBA@mail.gmail.com
/messages/by-id/CA+hUKGLQU=txW6maxZJoU3v-a42i=_UN2YLeB+aW6JuKcYrUBA@mail.gmail.com

thomas.munro@gmail.com

almost 5 years ago

In reply to: David Steele (#53)

3 attachment(s)

Re: 回复：Re: Cache relation sizes?

On Thu, Mar 4, 2021 at 2:39 AM David Steele <david@pgmasters.net> wrote:

On 1/18/21 10:42 PM, 陈佳昕(步真) wrote:

I want to share a patch with you, I change the replacement algorithm
from fifo to a simple lru.

What do you think of this change?

Ok, if I'm reading this right, it changes the replacement algorithm
from "clock" to "gclock". Gclock is like PostgreSQL's buffer pool,
except that here the usage count is limited with a GUC, instead of the
universal constant 5. Certainly gclock is better than clock. I used
the simplest algorithm I could think of, deliberately trying not to go
down the cache replacement algorithm rabbit hole. I think we should
match the buffer pool, however good/bad that may be, and adding a
usage count does seem to do that.

I had some trouble applying the patch, it seems to be corrupted, and
to apply on top of some other pach (looks like a struct moved from a
.c to a .h?), but I managed to figure it out and make the tests pass,
and I've attached a rebase.

Also, your patch set from [1] no longer applies (and of course this
latest patch is confusing the tester as well).

Rebased. The main change was commit bea449c6 "Optimize
DropRelFileNodesAllBuffers() for recovery.". With this rebase, that
optimisation now works in online, which is cool (by "online" I mean
not just in recovery).

Other changes in this version:

* Added spinlock backoff recommended by Andres in his review earlier.

Erm, that's it. There are still many review comments from Buzhen and
Andres to address. The main thing being: how can we clean these
objects in the background? That question being unanswered, there is a
0% chance of committing this in PostgreSQL 14. So I will now move
this to the next CF.

Thanks for the reviews so far!

Attachments:

v4-0001-WIP-Track-relation-sizes-in-shared-memory.patchtext/x-patch; charset=UTF-8; name=v4-0001-WIP-Track-relation-sizes-in-shared-memory.patchDownload

From 735925fb72a8417e7940fca352877042266ae06b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH v4 1/3] WIP: Track relation sizes in shared memory.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.  "Dirty" SMgrSharedRelations can only be evicted
after being synced to disk.  Goals:

1.  Provide faster lookups of relation sizes, cutting down on lseek()
calls.  This supercedes the recovery-only caching added recently, and
replaces preexisting FSM and VM caching schemes.

2.  Stop trusting the operating system to keep track of the size of
files that we have recently extended, until fsync() has been called.

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

XXX currently reusing the array of locks also used for buffer mapping,
need to define some more in lwlocks.c...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
Reviewed-by: 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
---
 contrib/pg_visibility/pg_visibility.c     |   1 -
 src/backend/access/heap/visibilitymap.c   |  28 +-
 src/backend/catalog/storage.c             |   2 -
 src/backend/commands/dbcommands.c         |   3 +
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/buffer/bufmgr.c       |  44 +-
 src/backend/storage/freespace/freespace.c |  35 +-
 src/backend/storage/ipc/ipci.c            |   3 +
 src/backend/storage/smgr/md.c             |  10 +
 src/backend/storage/smgr/smgr.c           | 559 ++++++++++++++++++++--
 src/backend/utils/misc/guc.c              |  11 +
 src/include/pgstat.h                      |   1 +
 src/include/storage/smgr.h                |  19 +-
 13 files changed, 579 insertions(+), 140 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..ca00c3d384 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..59bd0e924a 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -547,6 +547,7 @@ static Buffer
 vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 {
 	Buffer		buf;
+	BlockNumber nblocks;
 
 	/*
 	 * We might not have opened the relation at the smgr level yet, or we
@@ -557,20 +558,12 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the visibility map fork yet, check it
-	 * first.
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
-	{
-		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
-	}
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+	nblocks = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -636,17 +629,10 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+	/* * Create the file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..5796567cd9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -290,8 +290,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	for (int i = 0; i <= MAX_FORKNUM; ++i)
-		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..cf4019f92e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -970,6 +970,9 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	DropDatabaseBuffers(db_id);
 
+	/* Drop SMGR relations. */
+	smgrdropdb(db_id);
+
 	/*
 	 * Tell the stats collector to forget it immediately, too.
 	 */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 68eefb9722..951af691ba 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4121,6 +4121,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
+		case WAIT_EVENT_SMGR_DROP_SYNC:
+			event_name = "SmgrDropSync";
+			break;
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4c1d5eceb4..9e0aaea84c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3011,30 +3011,11 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 * that we don't leave any buffer for the relation being dropped as
 	 * otherwise the background writer or checkpointer can lead to a PANIC
 	 * error while flushing buffers corresponding to files that don't exist.
-	 *
-	 * To know the exact size, we rely on the size cached for each fork by us
-	 * during recovery which limits the optimization to recovery and on
-	 * standbys but we can easily extend it once we have shared cache for
-	 * relation size.
-	 *
-	 * In recovery, we cache the value returned by the first lseek(SEEK_END)
-	 * and the future writes keeps the cached value up-to-date. See
-	 * smgrextend. It is possible that the value of the first lseek is smaller
-	 * than the actual number of existing blocks in the file due to buggy
-	 * Linux kernels that might not have accounted for the recent write. But
-	 * that should be fine because there must not be any buffers after that
-	 * file size.
 	 */
 	for (i = 0; i < nforks; i++)
 	{
 		/* Get the number of blocks for a relation's fork */
-		nForkBlock[i] = smgrnblocks_cached(smgr_reln, forkNum[i]);
-
-		if (nForkBlock[i] == InvalidBlockNumber)
-		{
-			nBlocksToInvalidate = InvalidBlockNumber;
-			break;
-		}
+		nForkBlock[i] = smgrnblocks(smgr_reln, forkNum[i]);
 
 		/* calculate the number of blocks to be invalidated */
 		nBlocksToInvalidate += (nForkBlock[i] - firstDelBlock[i]);
@@ -3044,8 +3025,7 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 * We apply the optimization iff the total number of blocks to invalidate
 	 * is below the BUF_DROP_FULL_SCAN_THRESHOLD.
 	 */
-	if (BlockNumberIsValid(nBlocksToInvalidate) &&
-		nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
+	if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 	{
 		for (j = 0; j < nforks; j++)
 			FindAndDropRelFileNodeBuffers(rnode.node, forkNum[j],
@@ -3113,7 +3093,6 @@ DropRelFileNodesAllBuffers(SMgrRelation *smgr_reln, int nnodes)
 	BlockNumber (*block)[MAX_FORKNUM + 1];
 	BlockNumber nBlocksToInvalidate = 0;
 	RelFileNode *nodes;
-	bool		cached = true;
 	bool		use_bsearch;
 
 	if (nnodes == 0)
@@ -3154,22 +3133,19 @@ DropRelFileNodesAllBuffers(SMgrRelation *smgr_reln, int nnodes)
 	 * We can avoid scanning the entire buffer pool if we know the exact size
 	 * of each of the given relation forks. See DropRelFileNodeBuffers.
 	 */
-	for (i = 0; i < n && cached; i++)
+	for (i = 0; i < n; i++)
 	{
 		for (j = 0; j <= MAX_FORKNUM; j++)
 		{
-			/* Get the number of blocks for a relation's fork. */
-			block[i][j] = smgrnblocks_cached(rels[i], j);
-
-			/* We need to only consider the relation forks that exists. */
-			if (block[i][j] == InvalidBlockNumber)
+			if (!smgrexists(rels[i], j))
 			{
-				if (!smgrexists(rels[i], j))
-					continue;
-				cached = false;
-				break;
+				block[i][j] = InvalidBlockNumber;
+				continue;
 			}
 
+			/* Get the number of blocks for a relation's fork. */
+			block[i][j] = smgrnblocks(rels[i], j);
+
 			/* calculate the total number of blocks to be invalidated */
 			nBlocksToInvalidate += block[i][j];
 		}
@@ -3179,7 +3155,7 @@ DropRelFileNodesAllBuffers(SMgrRelation *smgr_reln, int nnodes)
 	 * We apply the optimization iff the total number of blocks to invalidate
 	 * is below the BUF_DROP_FULL_SCAN_THRESHOLD.
 	 */
-	if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
+	if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 	{
 		for (i = 0; i < n; i++)
 		{
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda238..05930ca846 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -531,29 +531,18 @@ static Buffer
 fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
+	BlockNumber nblocks;
 	Buffer		buf;
 
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the FSM yet, check it first.  Also
-	 * recheck if the requested block seems to be past end, since our cached
-	 * value might be stale.  (We send smgr inval messages on truncation, but
-	 * not on extension.)
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
-	{
-		/* Invalidate the cache so smgrnblocks asks the kernel. */
-		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
-	}
+	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+		nblocks = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+	else
+		nblocks = 0;
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -621,18 +610,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	/* Create the FSM file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..2cecb8acd7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -121,6 +121,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
 		size = add_size(size, dsm_estimate_size());
+		size = add_size(size, smgr_shmem_size());
 		size = add_size(size, BufferShmemSize());
 		size = add_size(size, LockShmemSize());
 		size = add_size(size, PredicateLockShmemSize());
@@ -212,6 +213,8 @@ CreateSharedMemoryAndSemaphores(void)
 
 	dsm_shmem_init();
 
+	smgr_shmem_init();
+
 	/*
 	 * Set up xlog, clog, and buffers
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..b78edd399f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1359,6 +1359,16 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	/*
+	 * XXX: We could have an interface smgrbeginclean() that would return true
+	 * if it has managed to set SR_SYNC and clean SR_JUST_DIRTIED, and then if
+	 * so, after our sync runs we could call smgrfinishclean() with our
+	 * success/failure report, which would clear SR_DIRTY if SR_JUST_DIRTIED
+	 * hasn't been set in the meantime.  But... how can we know if *this*
+	 * segment is one that represents an extension?  SR_DIRTY is just
+	 * interested in syncing extended blocks.
+	 */
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..80f942e1f6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,13 +19,68 @@
 
 #include "access/xlog.h"
 #include "lib/ilist.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/shmem.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/*
+ * An entry in the hash table that allows us to look up objects in the
+ * SMgrSharedRelation pool by rnode (+ backend).
+ */
+typedef struct SMgrSharedRelationMapping
+{
+	RelFileNodeBackend rnode;
+	int				index;
+} SMgrSharedRelationMapping;
+
+/*
+ * An object in shared memory tracks the size of the forks of a relation.
+ */
+struct SMgrSharedRelation
+{
+	RelFileNodeBackend rnode;
+	BlockNumber		nblocks[MAX_FORKNUM + 1];
+	pg_atomic_uint32 flags;
+};
+
+/* For now, we borrow the buffer managers array of locks.  XXX fixme */
+#define SR_PARTITIONS NUM_BUFFER_PARTITIONS
+#define SR_PARTITION_LOCK(hash) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET].lock)
+
+/* Flags. */
+#define SR_LOCKED					0x01
+#define SR_VALID					0x02
+
+/* Each forknum gets its own dirty, syncing and just dirtied bits. */
+#define SR_DIRTY(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
+#define SR_SYNCING(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
+#define SR_JUST_DIRTIED(forknum)	(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))
+
+/* Masks to test if any forknum is currently dirty or syncing. */
+#define SR_SYNCING_MASK				(((SR_SYNCING(MAX_FORKNUM + 1) - 1) ^ (SR_SYNCING(0) - 1)))
+#define SR_DIRTY_MASK				(((SR_DIRTY(MAX_FORKNUM + 1) - 1) ^ (SR_DIRTY(0) - 1)))
+
+/* Extract the lowest dirty forknum from flags (there must be at least one). */
+#define SR_GET_ONE_DIRTY(mask)		pg_rightmost_one_pos32((((mask) >> 2) & (SR_DIRTY_MASK >> 2)))
+
+typedef struct SMgrSharedRelationPool
+{
+	ConditionVariable sync_flags_cleared;
+	pg_atomic_uint32 next;
+	SMgrSharedRelation objects[FLEXIBLE_ARRAY_MEMBER];
+} SMgrSharedRelationPool;
+
+static SMgrSharedRelationPool *sr_pool;
+static HTAB *sr_mapping_table;
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -98,6 +153,422 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/* GUCs. */
+int smgr_shared_relations = 1000;
+
+/*
+ * Try to get the size of a relation's fork by looking it up in the mapping
+ * table with a shared lock.  This will succeed if the SMgrRelation already
+ * exists.
+ */
+static BlockNumber
+smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	BlockNumber result = InvalidBlockNumber;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		result = sr->nblocks[forknum];
+	}
+	LWLockRelease(mapping_lock);
+
+	return result;
+}
+
+/*
+ * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
+ * only a few instructions.  The return value is the current set of flags,
+ * which may be modified and then passed to smgr_unlock_sr() to be atomically
+ * when the lock is released.
+ */
+static uint32
+smgr_lock_sr(SMgrSharedRelation *sr)
+{
+	SpinDelayStatus spin_delay_status;
+
+	init_local_spin_delay(&spin_delay_status);
+
+	for (;;)
+	{
+		uint32	old_flags = pg_atomic_read_u32(&sr->flags);
+		uint32	flags;
+
+		if (!(old_flags & SR_LOCKED))
+		{
+			flags = old_flags | SR_LOCKED;
+			if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
+				return flags;
+		}
+
+		perform_spin_delay(&spin_delay_status);
+	}
+	return 0; /* unreachable */
+}
+
+/*
+ * Unlock a SMgrSharedRelation, atomically updating its flags at the same
+ * time.
+ */
+static void
+smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
+{
+	pg_write_barrier();
+	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
+}
+
+/*
+ * Allocate a new invalid SMgrSharedRelation, and return it locked.
+ *
+ * The replacement algorithm is a simple FIFO design with no second chance for
+ * now.
+ */
+static SMgrSharedRelation *
+smgr_alloc_sr(void)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32 index;
+	LWLock *mapping_lock;
+	uint32 flags;
+	RelFileNodeBackend rnode;
+	uint32 hash;
+
+ retry:
+	/* Lock the next one in clock-hand order. */
+	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+	sr = &sr_pool->objects[index];
+	flags = smgr_lock_sr(sr);
+
+	/* If it's unused, can return it, still locked, immediately. */
+	if (!(flags & SR_VALID))
+		return sr;
+
+	/*
+	 * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
+	 * locks, but we need to do it in that order, so we'll unlock the SR
+	 * first.
+	 */
+	rnode = sr->rnode;
+	smgr_unlock_sr(sr, flags);
+
+	hash = get_hash_value(sr_mapping_table, &rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (!mapping || mapping->index != index)
+	{
+		/* Too slow, it's gone or now points somewhere else.  Go around. */
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/* We will lock the SR for just a few instructions. */
+	flags = smgr_lock_sr(sr);
+	Assert(flags & SR_VALID);
+
+	/*
+	 * If another backend is currently syncing any fork, we aren't allowed to
+	 * evict it, and waiting for it would be pointless because that other
+	 * backend already plans to allocate it.  So go around.
+	 */
+	if (flags & SR_SYNCING_MASK)
+	{
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/*
+	 * We will sync every fork that is dirty, and then we'll try to
+	 * evict it.
+	 */
+	while (flags & SR_DIRTY_MASK)
+	{
+		SMgrRelation reln;
+		ForkNumber forknum = SR_GET_ONE_DIRTY(flags);
+
+		/* Set the sync bit, clear the just-dirtied bit and unlock. */
+		flags |= SR_SYNCING(forknum);
+		flags &= ~SR_JUST_DIRTIED(forknum);
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+
+		/*
+		 * Perform the I/O, with no locks held.
+		 * XXX It sucks that we fsync every segment, not just the ones that need it...
+		 */
+		reln = smgropen(rnode.node, rnode.backend);
+		smgrimmedsync(reln, forknum);
+
+		/*
+		 * Reacquire the locks.  The object can't have been evicted,
+		 * because we set a sync bit.
+		 */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_SYNCING(forknum));
+		flags &= ~SR_SYNCING(forknum);
+		if (flags & SR_JUST_DIRTIED(forknum))
+		{
+			/*
+			 * Someone else dirtied it while we were syncing, so we can't mark
+			 * it clean.  Let's give up on this SR and go around again.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+
+		/* This fork is clean! */
+		flags &= ~SR_DIRTY(forknum);
+	}
+
+	/*
+	 * If we made it this far, there are no dirty forks, so we're now allowed
+	 * to evict the SR from the pool and the mapping table.
+	 */
+	flags &= ~SR_VALID;
+	smgr_unlock_sr(sr, flags);
+
+	/*
+	 * If any callers to smgr_sr_drop() or smgr_sr_drop_db() had the misfortune
+	 * to have to wait for us to finish syncing, we can now wake them up.
+	 */
+	ConditionVariableBroadcast(&sr_pool->sync_flags_cleared);
+
+	/* Remove from the mapping table. */
+	hash_search_with_hash_value(sr_mapping_table,
+								&rnode,
+								hash,
+								HASH_REMOVE,
+								NULL);
+	LWLockRelease(mapping_lock);
+
+	/*
+	 * XXX: We unlock while doing HASH_REMOVE on principle.  Maybe it'd be OK
+	 * to hold it now that the clock hand is far away and there is no way
+	 * anyone can look up this SR through buffer mapping table.
+	 */
+	flags = smgr_lock_sr(sr);
+	if (flags & SR_VALID)
+	{
+		/* Oops, someone else got it. */
+		smgr_unlock_sr(sr, flags);
+		goto retry;
+	}
+
+	return sr;
+}
+
+/*
+ * Set the number of blocks in a relation, in shared memory, and optionally
+ * also mark the relation as "dirty" (meaning the it must be fsync'd before it
+ * can be evicted).
+ */
+static void
+smgrnblocks_update(SMgrRelation reln,
+				   ForkNumber forknum,
+				   BlockNumber nblocks,
+				   bool mark_dirty)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr = NULL;
+	uint32		hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+ retry:
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		flags = smgr_lock_sr(sr);
+		if (mark_dirty)
+		{
+			/*
+			 * Extend and truncate clobber the value, and there are no races
+			 * to worry about because they can have higher level exclusive
+			 * locking on the relation.
+			 */
+			sr->nblocks[forknum] = nblocks;
+
+			/*
+			 * Mark it dirty, and if it's currently being sync'd, make sure it
+			 * stays dirty after that completes.
+			 */
+			flags |= SR_DIRTY(forknum);
+			if (flags & SR_SYNCING(forknum))
+				flags |= SR_JUST_DIRTIED(forknum);
+		}
+		else if (!(flags & SR_DIRTY(forknum)))
+		{
+			/*
+			 * We won't clobber a dirty value with a non-dirty update, to
+			 * avoid races against concurrent extend/truncate, but we can
+			 * install a new clean value.
+			 */
+			sr->nblocks[forknum] = nblocks;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+	LWLockRelease(mapping_lock);
+
+	/* If we didn't find it, then we'll need to allocate one. */
+	if (!sr)
+	{
+		bool found;
+
+		sr = smgr_alloc_sr();
+
+		/* Upgrade to exclusive lock so we can create a mapping. */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		mapping = hash_search_with_hash_value(sr_mapping_table,
+											  &reln->smgr_rnode,
+											  hash,
+											  HASH_ENTER,
+											  &found);
+		if (!found)
+		{
+			/* Success!  Initialize. */
+			mapping->index = sr - sr_pool->objects;
+			smgr_unlock_sr(sr, SR_VALID);
+			sr->rnode = reln->smgr_rnode;
+			for (int i = 0; i <= MAX_FORKNUM; ++i)
+				sr->nblocks[i] = InvalidBlockNumber;
+			LWLockRelease(mapping_lock);
+		}
+		else
+		{
+			/* Someone beat us to it.  Go around again. */
+			smgr_unlock_sr(sr, 0);		/* = not valid */
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+	}
+}
+
+static void
+smgr_drop_sr(RelFileNodeBackend *rnode)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+retry:
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_VALID);
+
+		if (flags & SR_SYNCING_MASK)
+		{
+			/*
+			 * Oops, someone's syncing one of its forks; nothing to do but
+			 * wait until that's finished.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			ConditionVariableSleep(&sr_pool->sync_flags_cleared,
+								   WAIT_EVENT_SMGR_DROP_SYNC);
+			goto retry;
+		}
+		ConditionVariableCancelSleep();
+
+		/* Mark it invalid and drop the mapping. */
+		smgr_unlock_sr(sr, ~SR_VALID);
+		hash_search_with_hash_value(sr_mapping_table,
+									rnode,
+									hash,
+									HASH_REMOVE,
+									NULL);
+	}
+	LWLockRelease(mapping_lock);
+}
+
+size_t
+smgr_shmem_size(void)
+{
+	size_t size = 0;
+
+	size = add_size(size,
+					sizeof(offsetof(SMgrSharedRelationPool, objects) +
+						   sizeof(SMgrSharedRelation) * smgr_shared_relations));
+	size = add_size(size,
+					hash_estimate_size(smgr_shared_relations,
+									   sizeof(SMgrSharedRelationMapping)));
+
+	return size;
+}
+
+void
+smgr_shmem_init(void)
+{
+	HASHCTL		info;
+	bool found;
+
+	info.keysize = sizeof(RelFileNodeBackend);
+	info.entrysize = sizeof(SMgrSharedRelationMapping);
+	info.num_partitions = SR_PARTITIONS;
+	sr_mapping_table = ShmemInitHash("SMgrSharedRelation Mapping Table",
+									 smgr_shared_relations,
+									 smgr_shared_relations,
+									 &info,
+									 HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+
+	sr_pool = ShmemInitStruct("SMgrSharedRelation Pool",
+							  offsetof(SMgrSharedRelationPool, objects) +
+							  sizeof(SMgrSharedRelation) * smgr_shared_relations,
+							  &found);
+	if (!found)
+	{
+		ConditionVariableInit(&sr_pool->sync_flags_cleared);
+		pg_atomic_init_u32(&sr_pool->next, 0);
+		for (uint32 i = 0; i < smgr_shared_relations; ++i)
+		{
+			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+		}
+	}
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -174,8 +645,6 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		for (int i = 0; i <= MAX_FORKNUM; ++i)
-			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -246,6 +715,9 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
+	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+		return true;
+
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
@@ -429,6 +901,9 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 		CacheInvalidateSmgr(rnodes[i]);
 
+	for (i = 0; i < nrels; i++)
+		smgr_drop_sr(&rels[i]->smgr_rnode);
+
 	/*
 	 * Delete the physical file(s).
 	 *
@@ -464,16 +939,7 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
-
-	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
-	 * value isn't as expected, just invalidate it so the next call asks the
-	 * kernel.
-	 */
-	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
-	else
-		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+	smgrnblocks_update(reln, forknum, blocknum + 1, true);
 }
 
 /*
@@ -549,38 +1015,20 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
-	/* Check and return if we get the cached value for the number of blocks. */
-	result = smgrnblocks_cached(reln, forknum);
+	/* Can we get the answer from shared memory with only a share lock? */
+	result = smgrnblocks_shared(reln, forknum);
 	if (result != InvalidBlockNumber)
 		return result;
 
+	/* Ask the kernel. */
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
-	reln->smgr_cached_nblocks[forknum] = result;
+	/* Update the value in shared memory for faster service next time. */
+	smgrnblocks_update(reln, forknum, result, false);
 
 	return result;
 }
 
-/*
- *	smgrnblocks_cached() -- Get the cached number of blocks in the supplied
- *							relation.
- *
- * Returns an InvalidBlockNumber when not in recovery and when the relation
- * fork size is not cached.
- */
-BlockNumber
-smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
-{
-	/*
-	 * For now, we only use cached values in recovery due to lack of a shared
-	 * invalidation mechanism for changes in file size.
-	 */
-	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
-		return reln->smgr_cached_nblocks[forknum];
-
-	return InvalidBlockNumber;
-}
-
 /*
  *	smgrtruncate() -- Truncate the given forks of supplied relation to
  *					  each specified numbers of blocks
@@ -617,19 +1065,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
-		/* Make the cached size is invalid if we encounter an error. */
-		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
-
-		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_fsm_nblocks and
-		 * smgr_vm_nblocks, and these ones too at the next command boundary.
-		 * But these ensure they aren't outright wrong until then.
-		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		smgrnblocks_update(reln, forknum[i], nblocks[i], true);
 	}
 }
 
@@ -662,6 +1099,34 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+/*
+ * When a database is dropped, we have to find and throw away all its
+ * SMgrSharedRelation objects.
+ */
+void
+smgrdropdb(Oid database)
+{
+	for (int i = 0; i < smgr_shared_relations; ++i)
+	{
+		SMgrSharedRelation *sr = &sr_pool->objects[i];
+		RelFileNodeBackend rnode;
+		uint32 flags;
+
+		/* Hold the spinlock only while we copy out the rnode of matches. */
+		flags = smgr_lock_sr(sr);
+		if ((flags & SR_VALID) && sr->rnode.node.dbNode == database)
+		{
+			rnode = sr->rnode;
+			smgr_unlock_sr(sr, flags);
+
+			/* Drop, if it's still valid. */
+			smgr_drop_sr(&rnode);
+		}
+		else
+			smgr_unlock_sr(sr, flags);
+	}
+}
+
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..011a9aad1f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
+#include "storage/smgr.h"
 #include "storage/standby.h"
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
@@ -2467,6 +2468,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_shared_relations", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the number of shared relation objects in memory at one time."),
+			NULL
+		},
+		&smgr_shared_relations,
+		1000, 64, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f9166b8655..4059dfd83e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1008,6 +1008,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
+	WAIT_EVENT_SMGR_DROP_SYNC,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..a6b4c84362 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,13 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/* GUCs. */
+extern int smgr_shared_relations;
+
+/* Definition private to smgr.c. */
+struct SMgrSharedRelation;
+typedef struct SMgrSharedRelation SMgrSharedRelation;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -44,14 +51,7 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
-	/*
-	 * The following fields are reset to InvalidBlockNumber upon a cache flush
-	 * event, and hold the last known size for each fork.  This information is
-	 * currently only reliable during recovery, since there is no cache
-	 * invalidation for fork extension.
-	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
@@ -77,6 +77,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern size_t smgr_shmem_size(void);
+extern void smgr_shmem_init(void);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -99,10 +102,10 @@ extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
-extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrdropdb(Oid database);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
-- 
2.30.1

v4-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchtext/x-patch; charset=UTF-8; name=v4-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchDownload

From dbfdbace1b880c57d12c949e1f2bf3805e3f5184 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 19 Nov 2020 17:09:51 +1300
Subject: [PATCH v4 2/3] WIP: Provide a lock-free fast path for smgrnblocks().
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

SMgrRelation objects gain a pointer to the last known SMgrSharedRelation
object.  There are three concurrency hazards to worry about:

1.  The SMgrSharedRelation pointed to could be evicted at any time.  We
record a generation number, and insert memory barriers so that we can
detect that and fall back to a slower path.

2.  The nblocks value is read without any locking, which is atomic
because it is a 32 bit value, and PostgreSQL requires atomic 32 bit
reads generally.

3.  The nblocks value must be fresh enough for scans, extension,
truncatation and dropping buffers, because the those operations all
executed memory barriers when it acquired a snapshot to scan (which
doesn't need to see blocks added after that) or an exclusive heavyweight
lock to extend, truncate or drop.  XXX right?

XXX That's the idea anyway, but this is just a sketch; almost certainly
incomplet and inkorrect.

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
Reviewed-by: 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
---
 src/backend/storage/smgr/smgr.c | 63 +++++++++++++++++++++++++++++++--
 src/include/storage/smgr.h      |  4 +++
 2 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 80f942e1f6..ca877dc4af 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,7 @@ struct SMgrSharedRelation
 	RelFileNodeBackend rnode;
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
+	pg_atomic_uint64 generation;		/* mapping change */
 };
 
 /* For now, we borrow the buffer managers array of locks.  XXX fixme */
@@ -156,6 +157,40 @@ static void smgrshutdown(int code, Datum arg);
 /* GUCs. */
 int smgr_shared_relations = 1000;
 
+/*
+ * Try to get the size of a relation's fork without locking.
+ */
+static BlockNumber
+smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelation *sr = reln->smgr_shared;
+	BlockNumber result;
+
+	if (sr)
+	{
+		pg_read_barrier();
+
+		/* We can load int-sized values atomically without special measures. */
+		Assert(sizeof(sr->nblocks[forknum]) == sizeof(uint32));
+		result = sr->nblocks[forknum];
+
+		/*
+		 * With a read barrier between the loads, we can check that the object
+		 * still refers to the same rnode before trusting the answer.
+		 */
+		pg_read_barrier();
+		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+			return result;
+
+		/*
+		 * The generation doesn't match, the shared relation must have been
+		 * evicted since we got a pointer to it.  We'll need to do more work.
+		 */
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Try to get the size of a relation's fork by looking it up in the mapping
  * table with a shared lock.  This will succeed if the SMgrRelation already
@@ -183,6 +218,10 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 	{
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
+
+		/* We can take the fast path until this SR is eventually evicted. */
+		reln->smgr_shared = sr;
+		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
 	}
 	LWLockRelease(mapping_lock);
 
@@ -344,9 +383,14 @@ smgr_alloc_sr(void)
 
 	/*
 	 * If we made it this far, there are no dirty forks, so we're now allowed
-	 * to evict the SR from the pool and the mapping table.
+	 * to evict the SR from the pool and the mapping table.  Make sure that
+	 * smgrnblocks_fast() sees that its pointer is now invalid by bumping the
+	 * generation.
 	 */
 	flags &= ~SR_VALID;
+	pg_atomic_write_u64(&sr->generation,
+						pg_atomic_read_u64(&sr->generation) + 1);
+	pg_write_barrier();
 	smgr_unlock_sr(sr, flags);
 
 	/*
@@ -460,6 +504,8 @@ smgrnblocks_update(SMgrRelation reln,
 			mapping->index = sr - sr_pool->objects;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
+			pg_atomic_write_u64(&sr->generation,
+								pg_atomic_read_u64(&sr->generation) + 1);
 			for (int i = 0; i <= MAX_FORKNUM; ++i)
 				sr->nblocks[i] = InvalidBlockNumber;
 			LWLockRelease(mapping_lock);
@@ -514,6 +560,11 @@ retry:
 		}
 		ConditionVariableCancelSleep();
 
+		/* Make sure smgrnblocks_fast() knows it's invalidated. */
+		pg_atomic_write_u64(&sr->generation,
+							pg_atomic_read_u64(&sr->generation) + 1);
+		pg_write_barrier();
+
 		/* Mark it invalid and drop the mapping. */
 		smgr_unlock_sr(sr, ~SR_VALID);
 		hash_search_with_hash_value(sr_mapping_table,
@@ -566,6 +617,7 @@ smgr_shmem_init(void)
 		for (uint32 i = 0; i < smgr_shared_relations; ++i)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
 		}
 	}
 }
@@ -645,6 +697,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
+		reln->smgr_shared = NULL;
+		reln->smgr_shared_generation = 0;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -715,7 +769,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+	if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
 		return true;
 
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
@@ -1015,6 +1069,11 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
+	/* Can we get the answer from shared memory without locking? */
+	result = smgrnblocks_fast(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
+
 	/* Can we get the answer from shared memory with only a share lock? */
 	result = smgrnblocks_shared(reln, forknum);
 	if (result != InvalidBlockNumber)
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6b4c84362..09f17bff92 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -51,6 +51,10 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
+	/* pointer to shared object, valid if non-NULL and generation matches */
+	SMgrSharedRelation *smgr_shared;
+	uint64		smgr_shared_generation;
+
 	BlockNumber smgr_targblock; /* current insertion target block */
 
 	/* additional public fields may someday exist here */
-- 
2.30.1

v4-0003-update-fifo-to-lru-to-sweep-a-valid-cache.patchtext/x-patch; charset=US-ASCII; name=v4-0003-update-fifo-to-lru-to-sweep-a-valid-cache.patchDownload

From ba904aa8d32524049195f548d431d030546f4c15 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 11 Mar 2021 21:09:40 +1300
Subject: [PATCH v4 3/3] update fifo to lru to sweep a valid cache

Author: <buzhen.cjx@alibaba-inc.com>
---
 src/backend/storage/smgr/smgr.c | 55 +++++++++++++++++++++++++++++----
 src/backend/utils/misc/guc.c    | 10 ++++++
 src/include/storage/smgr.h      |  1 +
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ca877dc4af..d2cf20432c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -51,6 +51,7 @@ struct SMgrSharedRelation
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
 	pg_atomic_uint64 generation;		/* mapping change */
+	int64 usecount;     /* used for clock sweep */
 };
 
 /* For now, we borrow the buffer managers array of locks.  XXX fixme */
@@ -155,7 +156,8 @@ static dlist_head unowned_relns;
 static void smgrshutdown(int code, Datum arg);
 
 /* GUCs. */
-int smgr_shared_relations = 1000;
+int smgr_shared_relations = 10000;
+int smgr_pool_sweep_times = 32;
 
 /*
  * Try to get the size of a relation's fork without locking.
@@ -179,8 +181,13 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
 		 * still refers to the same rnode before trusting the answer.
 		 */
 		pg_read_barrier();
+
 		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+		{
+			/* no necessary to use a atomic operation, usecount can be imprecisely */
+			sr->usecount++;
 			return result;
+		}
 
 		/*
 		 * The generation doesn't match, the shared relation must have been
@@ -219,6 +226,9 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
 
+		/* no necessary to use a atomic operation, usecount can be imprecisely */
+		sr->usecount++;
+
 		/* We can take the fast path until this SR is eventually evicted. */
 		reln->smgr_shared = sr;
 		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
@@ -269,6 +279,36 @@ smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
 	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
 }
 
+/* LRU: sweep to find a sr to use. Just lock the sr when it returns */
+static SMgrSharedRelation *
+smgr_pool_sweep(void)
+{
+	SMgrSharedRelation *sr;
+	uint32 index;
+	uint32 flags;
+	int sr_used_count = 0;
+
+	for (;;)
+	{
+		/* Lock the next one in clock-hand order. */
+		index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+		sr = &sr_pool->objects[index];
+		flags = smgr_lock_sr(sr);
+		if (--(sr->usecount) <= 0)
+		{
+			elog(DEBUG5, "find block cache in sweep cache, use it");
+			return sr;
+		}
+		if (++sr_used_count >= smgr_shared_relations * smgr_pool_sweep_times)
+		{
+			elog(LOG, "all the block caches are used frequently, use a random one");
+			sr = &sr_pool->objects[random() % smgr_shared_relations];
+			return sr;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+}
+
 /*
  * Allocate a new invalid SMgrSharedRelation, and return it locked.
  *
@@ -287,11 +327,8 @@ smgr_alloc_sr(void)
 	uint32 hash;
 
  retry:
-	/* Lock the next one in clock-hand order. */
-	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
-	sr = &sr_pool->objects[index];
-	flags = smgr_lock_sr(sr);
-
+	sr = smgr_pool_sweep();
+	flags = pg_atomic_read_u32(&sr->flags);
 	/* If it's unused, can return it, still locked, immediately. */
 	if (!(flags & SR_VALID))
 		return sr;
@@ -301,6 +338,7 @@ smgr_alloc_sr(void)
 	 * locks, but we need to do it in that order, so we'll unlock the SR
 	 * first.
 	 */
+	index = sr - sr_pool->objects;
 	rnode = sr->rnode;
 	smgr_unlock_sr(sr, flags);
 
@@ -480,6 +518,8 @@ smgrnblocks_update(SMgrRelation reln,
 			 */
 			sr->nblocks[forknum] = nblocks;
 		}
+		if (sr->usecount < smgr_pool_sweep_times)
+			sr->usecount++;
 		smgr_unlock_sr(sr, flags);
 	}
 	LWLockRelease(mapping_lock);
@@ -502,6 +542,7 @@ smgrnblocks_update(SMgrRelation reln,
 		{
 			/* Success!  Initialize. */
 			mapping->index = sr - sr_pool->objects;
+			sr->usecount = 1;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
 			pg_atomic_write_u64(&sr->generation,
@@ -566,6 +607,7 @@ retry:
 		pg_write_barrier();
 
 		/* Mark it invalid and drop the mapping. */
+		sr->usecount = 0;
 		smgr_unlock_sr(sr, ~SR_VALID);
 		hash_search_with_hash_value(sr_mapping_table,
 									rnode,
@@ -618,6 +660,7 @@ smgr_shmem_init(void)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
 			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
+			sr_pool->objects[i].usecount = 0;
 		}
 	}
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 011a9aad1f..ce84c71040 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2608,6 +2608,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_pool_sweep_times", PGC_SIGHUP, UNGROUPED,
+			gettext_noop("Sets the times for smgr pool to clock sweep"),
+			NULL
+		},
+		&smgr_pool_sweep_times,
+		32, 16, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 09f17bff92..18109f90b2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -20,6 +20,7 @@
 
 /* GUCs. */
 extern int smgr_shared_relations;
+extern int smgr_pool_sweep_times;
 
 /* Definition private to smgr.c. */
 struct SMgrSharedRelation;
-- 
2.30.1

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#54)

3 attachment(s)

Re: 回复：Re: Cache relation sizes?

No change yet, just posting a rebase to keep cfbot happy.

One thing I'm wondering about is whether it'd be possible, and if so,
a good idea, to make a kind of tiny reusable cache replacement
algorithm, something modern, that can be used to kill several birds
with one stone (SLRUs, this object pool, ...). Or if the interlocking
requirements make it too integrated.

Attachments:

v5-0001-WIP-Track-relation-sizes-in-shared-memory.patchtext/x-patch; charset=UTF-8; name=v5-0001-WIP-Track-relation-sizes-in-shared-memory.patchDownload

From 2eeda704eb6c3b32f6c6cd9a6f3a434ffae15d31 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH v5 1/3] WIP: Track relation sizes in shared memory.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.  "Dirty" SMgrSharedRelations can only be evicted
after being synced to disk.  Goals:

1.  Provide faster lookups of relation sizes, cutting down on lseek()
calls.  This supercedes the recovery-only caching added recently, and
replaces preexisting FSM and VM caching schemes.

2.  Stop trusting the operating system to keep track of the size of
files that we have recently extended, until fsync() has been called.

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

XXX currently reusing the array of locks also used for buffer mapping,
need to define some more in lwlocks.c...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
Reviewed-by: 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
---
 contrib/pg_visibility/pg_visibility.c     |   1 -
 src/backend/access/heap/visibilitymap.c   |  28 +-
 src/backend/catalog/storage.c             |   2 -
 src/backend/commands/dbcommands.c         |   3 +
 src/backend/storage/buffer/bufmgr.c       |  44 +-
 src/backend/storage/freespace/freespace.c |  35 +-
 src/backend/storage/ipc/ipci.c            |   3 +
 src/backend/storage/smgr/md.c             |  10 +
 src/backend/storage/smgr/smgr.c           | 559 ++++++++++++++++++++--
 src/backend/utils/activity/wait_event.c   |   3 +
 src/backend/utils/misc/guc.c              |  11 +
 src/include/storage/smgr.h                |  19 +-
 src/include/utils/wait_event.h            |   1 +
 13 files changed, 579 insertions(+), 140 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..ca00c3d384 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..59bd0e924a 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -547,6 +547,7 @@ static Buffer
 vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 {
 	Buffer		buf;
+	BlockNumber nblocks;
 
 	/*
 	 * We might not have opened the relation at the smgr level yet, or we
@@ -557,20 +558,12 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the visibility map fork yet, check it
-	 * first.
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
-	{
-		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
-			smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
-	}
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+	nblocks = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			vm_extend(rel, blkno + 1);
@@ -636,17 +629,10 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+	/* * Create the file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
 		smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
 	/* Now extend the file */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..5796567cd9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -290,8 +290,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 	 */
 	rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
-	for (int i = 0; i <= MAX_FORKNUM; ++i)
-		rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
 	/* Prepare for truncation of MAIN fork of the relation */
 	forks[nforks] = MAIN_FORKNUM;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..cf4019f92e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -970,6 +970,9 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	DropDatabaseBuffers(db_id);
 
+	/* Drop SMGR relations. */
+	smgrdropdb(db_id);
+
 	/*
 	 * Tell the stats collector to forget it immediately, too.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a22c4..aae44311f8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3095,30 +3095,11 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 * that we don't leave any buffer for the relation being dropped as
 	 * otherwise the background writer or checkpointer can lead to a PANIC
 	 * error while flushing buffers corresponding to files that don't exist.
-	 *
-	 * To know the exact size, we rely on the size cached for each fork by us
-	 * during recovery which limits the optimization to recovery and on
-	 * standbys but we can easily extend it once we have shared cache for
-	 * relation size.
-	 *
-	 * In recovery, we cache the value returned by the first lseek(SEEK_END)
-	 * and the future writes keeps the cached value up-to-date. See
-	 * smgrextend. It is possible that the value of the first lseek is smaller
-	 * than the actual number of existing blocks in the file due to buggy
-	 * Linux kernels that might not have accounted for the recent write. But
-	 * that should be fine because there must not be any buffers after that
-	 * file size.
 	 */
 	for (i = 0; i < nforks; i++)
 	{
 		/* Get the number of blocks for a relation's fork */
-		nForkBlock[i] = smgrnblocks_cached(smgr_reln, forkNum[i]);
-
-		if (nForkBlock[i] == InvalidBlockNumber)
-		{
-			nBlocksToInvalidate = InvalidBlockNumber;
-			break;
-		}
+		nForkBlock[i] = smgrnblocks(smgr_reln, forkNum[i]);
 
 		/* calculate the number of blocks to be invalidated */
 		nBlocksToInvalidate += (nForkBlock[i] - firstDelBlock[i]);
@@ -3128,8 +3109,7 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 * We apply the optimization iff the total number of blocks to invalidate
 	 * is below the BUF_DROP_FULL_SCAN_THRESHOLD.
 	 */
-	if (BlockNumberIsValid(nBlocksToInvalidate) &&
-		nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
+	if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 	{
 		for (j = 0; j < nforks; j++)
 			FindAndDropRelFileNodeBuffers(rnode.node, forkNum[j],
@@ -3197,7 +3177,6 @@ DropRelFileNodesAllBuffers(SMgrRelation *smgr_reln, int nnodes)
 	BlockNumber (*block)[MAX_FORKNUM + 1];
 	uint64		nBlocksToInvalidate = 0;
 	RelFileNode *nodes;
-	bool		cached = true;
 	bool		use_bsearch;
 
 	if (nnodes == 0)
@@ -3238,22 +3217,19 @@ DropRelFileNodesAllBuffers(SMgrRelation *smgr_reln, int nnodes)
 	 * We can avoid scanning the entire buffer pool if we know the exact size
 	 * of each of the given relation forks. See DropRelFileNodeBuffers.
 	 */
-	for (i = 0; i < n && cached; i++)
+	for (i = 0; i < n; i++)
 	{
 		for (j = 0; j <= MAX_FORKNUM; j++)
 		{
-			/* Get the number of blocks for a relation's fork. */
-			block[i][j] = smgrnblocks_cached(rels[i], j);
-
-			/* We need to only consider the relation forks that exists. */
-			if (block[i][j] == InvalidBlockNumber)
+			if (!smgrexists(rels[i], j))
 			{
-				if (!smgrexists(rels[i], j))
-					continue;
-				cached = false;
-				break;
+				block[i][j] = InvalidBlockNumber;
+				continue;
 			}
 
+			/* Get the number of blocks for a relation's fork. */
+			block[i][j] = smgrnblocks(rels[i], j);
+
 			/* calculate the total number of blocks to be invalidated */
 			nBlocksToInvalidate += block[i][j];
 		}
@@ -3263,7 +3239,7 @@ DropRelFileNodesAllBuffers(SMgrRelation *smgr_reln, int nnodes)
 	 * We apply the optimization iff the total number of blocks to invalidate
 	 * is below the BUF_DROP_FULL_SCAN_THRESHOLD.
 	 */
-	if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
+	if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 	{
 		for (i = 0; i < n; i++)
 		{
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda238..05930ca846 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -531,29 +531,18 @@ static Buffer
 fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
+	BlockNumber nblocks;
 	Buffer		buf;
 
 	RelationOpenSmgr(rel);
 
-	/*
-	 * If we haven't cached the size of the FSM yet, check it first.  Also
-	 * recheck if the requested block seems to be past end, since our cached
-	 * value might be stale.  (We send smgr inval messages on truncation, but
-	 * not on extension.)
-	 */
-	if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
-	{
-		/* Invalidate the cache so smgrnblocks asks the kernel. */
-		rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-			smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
-		else
-			rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
-	}
+	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+		nblocks = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+	else
+		nblocks = 0;
 
 	/* Handle requests beyond EOF */
-	if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+	if (blkno >= nblocks)
 	{
 		if (extend)
 			fsm_extend(rel, blkno + 1);
@@ -621,18 +610,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
 
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	/* Create the FSM file first if it doesn't exist. */
+	if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..0082e127a9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -121,6 +121,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
 												 sizeof(ShmemIndexEnt)));
 		size = add_size(size, dsm_estimate_size());
+		size = add_size(size, smgr_shmem_size());
 		size = add_size(size, BufferShmemSize());
 		size = add_size(size, LockShmemSize());
 		size = add_size(size, PredicateLockShmemSize());
@@ -213,6 +214,8 @@ CreateSharedMemoryAndSemaphores(void)
 
 	dsm_shmem_init();
 
+	smgr_shmem_init();
+
 	/*
 	 * Set up xlog, clog, and buffers
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..b78edd399f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1359,6 +1359,16 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	/*
+	 * XXX: We could have an interface smgrbeginclean() that would return true
+	 * if it has managed to set SR_SYNC and clean SR_JUST_DIRTIED, and then if
+	 * so, after our sync runs we could call smgrfinishclean() with our
+	 * success/failure report, which would clear SR_DIRTY if SR_JUST_DIRTIED
+	 * hasn't been set in the meantime.  But... how can we know if *this*
+	 * segment is one that represents an extension?  SR_DIRTY is just
+	 * interested in syncing extended blocks.
+	 */
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..80f942e1f6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,13 +19,68 @@
 
 #include "access/xlog.h"
 #include "lib/ilist.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/shmem.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/*
+ * An entry in the hash table that allows us to look up objects in the
+ * SMgrSharedRelation pool by rnode (+ backend).
+ */
+typedef struct SMgrSharedRelationMapping
+{
+	RelFileNodeBackend rnode;
+	int				index;
+} SMgrSharedRelationMapping;
+
+/*
+ * An object in shared memory tracks the size of the forks of a relation.
+ */
+struct SMgrSharedRelation
+{
+	RelFileNodeBackend rnode;
+	BlockNumber		nblocks[MAX_FORKNUM + 1];
+	pg_atomic_uint32 flags;
+};
+
+/* For now, we borrow the buffer managers array of locks.  XXX fixme */
+#define SR_PARTITIONS NUM_BUFFER_PARTITIONS
+#define SR_PARTITION_LOCK(hash) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET].lock)
+
+/* Flags. */
+#define SR_LOCKED					0x01
+#define SR_VALID					0x02
+
+/* Each forknum gets its own dirty, syncing and just dirtied bits. */
+#define SR_DIRTY(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
+#define SR_SYNCING(forknum)			(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
+#define SR_JUST_DIRTIED(forknum)	(0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))
+
+/* Masks to test if any forknum is currently dirty or syncing. */
+#define SR_SYNCING_MASK				(((SR_SYNCING(MAX_FORKNUM + 1) - 1) ^ (SR_SYNCING(0) - 1)))
+#define SR_DIRTY_MASK				(((SR_DIRTY(MAX_FORKNUM + 1) - 1) ^ (SR_DIRTY(0) - 1)))
+
+/* Extract the lowest dirty forknum from flags (there must be at least one). */
+#define SR_GET_ONE_DIRTY(mask)		pg_rightmost_one_pos32((((mask) >> 2) & (SR_DIRTY_MASK >> 2)))
+
+typedef struct SMgrSharedRelationPool
+{
+	ConditionVariable sync_flags_cleared;
+	pg_atomic_uint32 next;
+	SMgrSharedRelation objects[FLEXIBLE_ARRAY_MEMBER];
+} SMgrSharedRelationPool;
+
+static SMgrSharedRelationPool *sr_pool;
+static HTAB *sr_mapping_table;
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -98,6 +153,422 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/* GUCs. */
+int smgr_shared_relations = 1000;
+
+/*
+ * Try to get the size of a relation's fork by looking it up in the mapping
+ * table with a shared lock.  This will succeed if the SMgrRelation already
+ * exists.
+ */
+static BlockNumber
+smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	BlockNumber result = InvalidBlockNumber;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		result = sr->nblocks[forknum];
+	}
+	LWLockRelease(mapping_lock);
+
+	return result;
+}
+
+/*
+ * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
+ * only a few instructions.  The return value is the current set of flags,
+ * which may be modified and then passed to smgr_unlock_sr() to be atomically
+ * when the lock is released.
+ */
+static uint32
+smgr_lock_sr(SMgrSharedRelation *sr)
+{
+	SpinDelayStatus spin_delay_status;
+
+	init_local_spin_delay(&spin_delay_status);
+
+	for (;;)
+	{
+		uint32	old_flags = pg_atomic_read_u32(&sr->flags);
+		uint32	flags;
+
+		if (!(old_flags & SR_LOCKED))
+		{
+			flags = old_flags | SR_LOCKED;
+			if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
+				return flags;
+		}
+
+		perform_spin_delay(&spin_delay_status);
+	}
+	return 0; /* unreachable */
+}
+
+/*
+ * Unlock a SMgrSharedRelation, atomically updating its flags at the same
+ * time.
+ */
+static void
+smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
+{
+	pg_write_barrier();
+	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
+}
+
+/*
+ * Allocate a new invalid SMgrSharedRelation, and return it locked.
+ *
+ * The replacement algorithm is a simple FIFO design with no second chance for
+ * now.
+ */
+static SMgrSharedRelation *
+smgr_alloc_sr(void)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32 index;
+	LWLock *mapping_lock;
+	uint32 flags;
+	RelFileNodeBackend rnode;
+	uint32 hash;
+
+ retry:
+	/* Lock the next one in clock-hand order. */
+	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+	sr = &sr_pool->objects[index];
+	flags = smgr_lock_sr(sr);
+
+	/* If it's unused, can return it, still locked, immediately. */
+	if (!(flags & SR_VALID))
+		return sr;
+
+	/*
+	 * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
+	 * locks, but we need to do it in that order, so we'll unlock the SR
+	 * first.
+	 */
+	rnode = sr->rnode;
+	smgr_unlock_sr(sr, flags);
+
+	hash = get_hash_value(sr_mapping_table, &rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (!mapping || mapping->index != index)
+	{
+		/* Too slow, it's gone or now points somewhere else.  Go around. */
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/* We will lock the SR for just a few instructions. */
+	flags = smgr_lock_sr(sr);
+	Assert(flags & SR_VALID);
+
+	/*
+	 * If another backend is currently syncing any fork, we aren't allowed to
+	 * evict it, and waiting for it would be pointless because that other
+	 * backend already plans to allocate it.  So go around.
+	 */
+	if (flags & SR_SYNCING_MASK)
+	{
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+		goto retry;
+	}
+
+	/*
+	 * We will sync every fork that is dirty, and then we'll try to
+	 * evict it.
+	 */
+	while (flags & SR_DIRTY_MASK)
+	{
+		SMgrRelation reln;
+		ForkNumber forknum = SR_GET_ONE_DIRTY(flags);
+
+		/* Set the sync bit, clear the just-dirtied bit and unlock. */
+		flags |= SR_SYNCING(forknum);
+		flags &= ~SR_JUST_DIRTIED(forknum);
+		smgr_unlock_sr(sr, flags);
+		LWLockRelease(mapping_lock);
+
+		/*
+		 * Perform the I/O, with no locks held.
+		 * XXX It sucks that we fsync every segment, not just the ones that need it...
+		 */
+		reln = smgropen(rnode.node, rnode.backend);
+		smgrimmedsync(reln, forknum);
+
+		/*
+		 * Reacquire the locks.  The object can't have been evicted,
+		 * because we set a sync bit.
+		 */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_SYNCING(forknum));
+		flags &= ~SR_SYNCING(forknum);
+		if (flags & SR_JUST_DIRTIED(forknum))
+		{
+			/*
+			 * Someone else dirtied it while we were syncing, so we can't mark
+			 * it clean.  Let's give up on this SR and go around again.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+
+		/* This fork is clean! */
+		flags &= ~SR_DIRTY(forknum);
+	}
+
+	/*
+	 * If we made it this far, there are no dirty forks, so we're now allowed
+	 * to evict the SR from the pool and the mapping table.
+	 */
+	flags &= ~SR_VALID;
+	smgr_unlock_sr(sr, flags);
+
+	/*
+	 * If any callers to smgr_sr_drop() or smgr_sr_drop_db() had the misfortune
+	 * to have to wait for us to finish syncing, we can now wake them up.
+	 */
+	ConditionVariableBroadcast(&sr_pool->sync_flags_cleared);
+
+	/* Remove from the mapping table. */
+	hash_search_with_hash_value(sr_mapping_table,
+								&rnode,
+								hash,
+								HASH_REMOVE,
+								NULL);
+	LWLockRelease(mapping_lock);
+
+	/*
+	 * XXX: We unlock while doing HASH_REMOVE on principle.  Maybe it'd be OK
+	 * to hold it now that the clock hand is far away and there is no way
+	 * anyone can look up this SR through buffer mapping table.
+	 */
+	flags = smgr_lock_sr(sr);
+	if (flags & SR_VALID)
+	{
+		/* Oops, someone else got it. */
+		smgr_unlock_sr(sr, flags);
+		goto retry;
+	}
+
+	return sr;
+}
+
+/*
+ * Set the number of blocks in a relation, in shared memory, and optionally
+ * also mark the relation as "dirty" (meaning the it must be fsync'd before it
+ * can be evicted).
+ */
+static void
+smgrnblocks_update(SMgrRelation reln,
+				   ForkNumber forknum,
+				   BlockNumber nblocks,
+				   bool mark_dirty)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr = NULL;
+	uint32		hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+ retry:
+	LWLockAcquire(mapping_lock, LW_SHARED);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  &reln->smgr_rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+		flags = smgr_lock_sr(sr);
+		if (mark_dirty)
+		{
+			/*
+			 * Extend and truncate clobber the value, and there are no races
+			 * to worry about because they can have higher level exclusive
+			 * locking on the relation.
+			 */
+			sr->nblocks[forknum] = nblocks;
+
+			/*
+			 * Mark it dirty, and if it's currently being sync'd, make sure it
+			 * stays dirty after that completes.
+			 */
+			flags |= SR_DIRTY(forknum);
+			if (flags & SR_SYNCING(forknum))
+				flags |= SR_JUST_DIRTIED(forknum);
+		}
+		else if (!(flags & SR_DIRTY(forknum)))
+		{
+			/*
+			 * We won't clobber a dirty value with a non-dirty update, to
+			 * avoid races against concurrent extend/truncate, but we can
+			 * install a new clean value.
+			 */
+			sr->nblocks[forknum] = nblocks;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+	LWLockRelease(mapping_lock);
+
+	/* If we didn't find it, then we'll need to allocate one. */
+	if (!sr)
+	{
+		bool found;
+
+		sr = smgr_alloc_sr();
+
+		/* Upgrade to exclusive lock so we can create a mapping. */
+		LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+		mapping = hash_search_with_hash_value(sr_mapping_table,
+											  &reln->smgr_rnode,
+											  hash,
+											  HASH_ENTER,
+											  &found);
+		if (!found)
+		{
+			/* Success!  Initialize. */
+			mapping->index = sr - sr_pool->objects;
+			smgr_unlock_sr(sr, SR_VALID);
+			sr->rnode = reln->smgr_rnode;
+			for (int i = 0; i <= MAX_FORKNUM; ++i)
+				sr->nblocks[i] = InvalidBlockNumber;
+			LWLockRelease(mapping_lock);
+		}
+		else
+		{
+			/* Someone beat us to it.  Go around again. */
+			smgr_unlock_sr(sr, 0);		/* = not valid */
+			LWLockRelease(mapping_lock);
+			goto retry;
+		}
+	}
+}
+
+static void
+smgr_drop_sr(RelFileNodeBackend *rnode)
+{
+	SMgrSharedRelationMapping *mapping;
+	SMgrSharedRelation *sr;
+	uint32	hash;
+	LWLock *mapping_lock;
+	uint32 flags;
+
+	hash = get_hash_value(sr_mapping_table, rnode);
+	mapping_lock = SR_PARTITION_LOCK(hash);
+
+retry:
+	LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
+	mapping = hash_search_with_hash_value(sr_mapping_table,
+										  rnode,
+										  hash,
+										  HASH_FIND,
+										  NULL);
+	if (mapping)
+	{
+		sr = &sr_pool->objects[mapping->index];
+
+		flags = smgr_lock_sr(sr);
+		Assert(flags & SR_VALID);
+
+		if (flags & SR_SYNCING_MASK)
+		{
+			/*
+			 * Oops, someone's syncing one of its forks; nothing to do but
+			 * wait until that's finished.
+			 */
+			smgr_unlock_sr(sr, flags);
+			LWLockRelease(mapping_lock);
+			ConditionVariableSleep(&sr_pool->sync_flags_cleared,
+								   WAIT_EVENT_SMGR_DROP_SYNC);
+			goto retry;
+		}
+		ConditionVariableCancelSleep();
+
+		/* Mark it invalid and drop the mapping. */
+		smgr_unlock_sr(sr, ~SR_VALID);
+		hash_search_with_hash_value(sr_mapping_table,
+									rnode,
+									hash,
+									HASH_REMOVE,
+									NULL);
+	}
+	LWLockRelease(mapping_lock);
+}
+
+size_t
+smgr_shmem_size(void)
+{
+	size_t size = 0;
+
+	size = add_size(size,
+					sizeof(offsetof(SMgrSharedRelationPool, objects) +
+						   sizeof(SMgrSharedRelation) * smgr_shared_relations));
+	size = add_size(size,
+					hash_estimate_size(smgr_shared_relations,
+									   sizeof(SMgrSharedRelationMapping)));
+
+	return size;
+}
+
+void
+smgr_shmem_init(void)
+{
+	HASHCTL		info;
+	bool found;
+
+	info.keysize = sizeof(RelFileNodeBackend);
+	info.entrysize = sizeof(SMgrSharedRelationMapping);
+	info.num_partitions = SR_PARTITIONS;
+	sr_mapping_table = ShmemInitHash("SMgrSharedRelation Mapping Table",
+									 smgr_shared_relations,
+									 smgr_shared_relations,
+									 &info,
+									 HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+
+	sr_pool = ShmemInitStruct("SMgrSharedRelation Pool",
+							  offsetof(SMgrSharedRelationPool, objects) +
+							  sizeof(SMgrSharedRelation) * smgr_shared_relations,
+							  &found);
+	if (!found)
+	{
+		ConditionVariableInit(&sr_pool->sync_flags_cleared);
+		pg_atomic_init_u32(&sr_pool->next, 0);
+		for (uint32 i = 0; i < smgr_shared_relations; ++i)
+		{
+			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+		}
+	}
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -174,8 +645,6 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
-		for (int i = 0; i <= MAX_FORKNUM; ++i)
-			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -246,6 +715,9 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
+	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+		return true;
+
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
@@ -429,6 +901,9 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 		CacheInvalidateSmgr(rnodes[i]);
 
+	for (i = 0; i < nrels; i++)
+		smgr_drop_sr(&rels[i]->smgr_rnode);
+
 	/*
 	 * Delete the physical file(s).
 	 *
@@ -464,16 +939,7 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
-
-	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
-	 * value isn't as expected, just invalidate it so the next call asks the
-	 * kernel.
-	 */
-	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
-	else
-		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+	smgrnblocks_update(reln, forknum, blocknum + 1, true);
 }
 
 /*
@@ -549,38 +1015,20 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
-	/* Check and return if we get the cached value for the number of blocks. */
-	result = smgrnblocks_cached(reln, forknum);
+	/* Can we get the answer from shared memory with only a share lock? */
+	result = smgrnblocks_shared(reln, forknum);
 	if (result != InvalidBlockNumber)
 		return result;
 
+	/* Ask the kernel. */
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
-	reln->smgr_cached_nblocks[forknum] = result;
+	/* Update the value in shared memory for faster service next time. */
+	smgrnblocks_update(reln, forknum, result, false);
 
 	return result;
 }
 
-/*
- *	smgrnblocks_cached() -- Get the cached number of blocks in the supplied
- *							relation.
- *
- * Returns an InvalidBlockNumber when not in recovery and when the relation
- * fork size is not cached.
- */
-BlockNumber
-smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
-{
-	/*
-	 * For now, we only use cached values in recovery due to lack of a shared
-	 * invalidation mechanism for changes in file size.
-	 */
-	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
-		return reln->smgr_cached_nblocks[forknum];
-
-	return InvalidBlockNumber;
-}
-
 /*
  *	smgrtruncate() -- Truncate the given forks of supplied relation to
  *					  each specified numbers of blocks
@@ -617,19 +1065,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	/* Do the truncation */
 	for (i = 0; i < nforks; i++)
 	{
-		/* Make the cached size is invalid if we encounter an error. */
-		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
-
 		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);
-
-		/*
-		 * We might as well update the local smgr_cached_nblocks values. The
-		 * smgr cache inval message that this function sent will cause other
-		 * backends to invalidate their copies of smgr_fsm_nblocks and
-		 * smgr_vm_nblocks, and these ones too at the next command boundary.
-		 * But these ensure they aren't outright wrong until then.
-		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		smgrnblocks_update(reln, forknum[i], nblocks[i], true);
 	}
 }
 
@@ -662,6 +1099,34 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+/*
+ * When a database is dropped, we have to find and throw away all its
+ * SMgrSharedRelation objects.
+ */
+void
+smgrdropdb(Oid database)
+{
+	for (int i = 0; i < smgr_shared_relations; ++i)
+	{
+		SMgrSharedRelation *sr = &sr_pool->objects[i];
+		RelFileNodeBackend rnode;
+		uint32 flags;
+
+		/* Hold the spinlock only while we copy out the rnode of matches. */
+		flags = smgr_lock_sr(sr);
+		if ((flags & SR_VALID) && sr->rnode.node.dbNode == database)
+		{
+			rnode = sr->rnode;
+			smgr_unlock_sr(sr, flags);
+
+			/* Drop, if it's still valid. */
+			smgr_drop_sr(&rnode);
+		}
+		else
+			smgr_unlock_sr(sr, flags);
+	}
+}
+
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6baf67740c..eda9a3d0fa 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -439,6 +439,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
+		case WAIT_EVENT_SMGR_DROP_SYNC:
+			event_name = "SmgrDropSync";
+			break;
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68b62d523d..9daf04a5bc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -86,6 +86,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
+#include "storage/smgr.h"
 #include "storage/standby.h"
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
@@ -2518,6 +2519,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_shared_relations", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the number of shared relation objects in memory at one time."),
+			NULL
+		},
+		&smgr_shared_relations,
+		1000, 64, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..a6b4c84362 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,13 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/* GUCs. */
+extern int smgr_shared_relations;
+
+/* Definition private to smgr.c. */
+struct SMgrSharedRelation;
+typedef struct SMgrSharedRelation SMgrSharedRelation;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -44,14 +51,7 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
-	/*
-	 * The following fields are reset to InvalidBlockNumber upon a cache flush
-	 * event, and hold the last known size for each fork.  This information is
-	 * currently only reliable during recovery, since there is no cache
-	 * invalidation for fork extension.
-	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
-	BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1];	/* last known size */
 
 	/* additional public fields may someday exist here */
 
@@ -77,6 +77,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern size_t smgr_shmem_size(void);
+extern void smgr_shmem_init(void);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -99,10 +102,10 @@ extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
-extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrdropdb(Oid database);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6c6ec2e711..221061f08b 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -122,6 +122,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
+	WAIT_EVENT_SMGR_DROP_SYNC,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-- 
2.30.2

v5-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchtext/x-patch; charset=UTF-8; name=v5-0002-WIP-Provide-a-lock-free-fast-path-for-smgrnblocks.patchDownload

From 2d99a95e5ba1c628fd98ab0c6dc9d4114342704b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 19 Nov 2020 17:09:51 +1300
Subject: [PATCH v5 2/3] WIP: Provide a lock-free fast path for smgrnblocks().
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

SMgrRelation objects gain a pointer to the last known SMgrSharedRelation
object.  There are three concurrency hazards to worry about:

1.  The SMgrSharedRelation pointed to could be evicted at any time.  We
record a generation number, and insert memory barriers so that we can
detect that and fall back to a slower path.

2.  The nblocks value is read without any locking, which is atomic
because it is a 32 bit value, and PostgreSQL requires atomic 32 bit
reads generally.

3.  The nblocks value must be fresh enough for scans, extension,
truncatation and dropping buffers, because the those operations all
executed memory barriers when it acquired a snapshot to scan (which
doesn't need to see blocks added after that) or an exclusive heavyweight
lock to extend, truncate or drop.  XXX right?

XXX That's the idea anyway, but this is just a sketch; almost certainly
incomplet and inkorrect.

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
Reviewed-by: 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
---
 src/backend/storage/smgr/smgr.c | 63 +++++++++++++++++++++++++++++++--
 src/include/storage/smgr.h      |  4 +++
 2 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 80f942e1f6..ca877dc4af 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,7 @@ struct SMgrSharedRelation
 	RelFileNodeBackend rnode;
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
+	pg_atomic_uint64 generation;		/* mapping change */
 };
 
 /* For now, we borrow the buffer managers array of locks.  XXX fixme */
@@ -156,6 +157,40 @@ static void smgrshutdown(int code, Datum arg);
 /* GUCs. */
 int smgr_shared_relations = 1000;
 
+/*
+ * Try to get the size of a relation's fork without locking.
+ */
+static BlockNumber
+smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
+{
+	SMgrSharedRelation *sr = reln->smgr_shared;
+	BlockNumber result;
+
+	if (sr)
+	{
+		pg_read_barrier();
+
+		/* We can load int-sized values atomically without special measures. */
+		Assert(sizeof(sr->nblocks[forknum]) == sizeof(uint32));
+		result = sr->nblocks[forknum];
+
+		/*
+		 * With a read barrier between the loads, we can check that the object
+		 * still refers to the same rnode before trusting the answer.
+		 */
+		pg_read_barrier();
+		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+			return result;
+
+		/*
+		 * The generation doesn't match, the shared relation must have been
+		 * evicted since we got a pointer to it.  We'll need to do more work.
+		 */
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Try to get the size of a relation's fork by looking it up in the mapping
  * table with a shared lock.  This will succeed if the SMgrRelation already
@@ -183,6 +218,10 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 	{
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
+
+		/* We can take the fast path until this SR is eventually evicted. */
+		reln->smgr_shared = sr;
+		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
 	}
 	LWLockRelease(mapping_lock);
 
@@ -344,9 +383,14 @@ smgr_alloc_sr(void)
 
 	/*
 	 * If we made it this far, there are no dirty forks, so we're now allowed
-	 * to evict the SR from the pool and the mapping table.
+	 * to evict the SR from the pool and the mapping table.  Make sure that
+	 * smgrnblocks_fast() sees that its pointer is now invalid by bumping the
+	 * generation.
 	 */
 	flags &= ~SR_VALID;
+	pg_atomic_write_u64(&sr->generation,
+						pg_atomic_read_u64(&sr->generation) + 1);
+	pg_write_barrier();
 	smgr_unlock_sr(sr, flags);
 
 	/*
@@ -460,6 +504,8 @@ smgrnblocks_update(SMgrRelation reln,
 			mapping->index = sr - sr_pool->objects;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
+			pg_atomic_write_u64(&sr->generation,
+								pg_atomic_read_u64(&sr->generation) + 1);
 			for (int i = 0; i <= MAX_FORKNUM; ++i)
 				sr->nblocks[i] = InvalidBlockNumber;
 			LWLockRelease(mapping_lock);
@@ -514,6 +560,11 @@ retry:
 		}
 		ConditionVariableCancelSleep();
 
+		/* Make sure smgrnblocks_fast() knows it's invalidated. */
+		pg_atomic_write_u64(&sr->generation,
+							pg_atomic_read_u64(&sr->generation) + 1);
+		pg_write_barrier();
+
 		/* Mark it invalid and drop the mapping. */
 		smgr_unlock_sr(sr, ~SR_VALID);
 		hash_search_with_hash_value(sr_mapping_table,
@@ -566,6 +617,7 @@ smgr_shmem_init(void)
 		for (uint32 i = 0; i < smgr_shared_relations; ++i)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
+			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
 		}
 	}
 }
@@ -645,6 +697,8 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
+		reln->smgr_shared = NULL;
+		reln->smgr_shared_generation = 0;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
@@ -715,7 +769,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
+	if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
 		return true;
 
 	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
@@ -1015,6 +1069,11 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	BlockNumber result;
 
+	/* Can we get the answer from shared memory without locking? */
+	result = smgrnblocks_fast(reln, forknum);
+	if (result != InvalidBlockNumber)
+		return result;
+
 	/* Can we get the answer from shared memory with only a share lock? */
 	result = smgrnblocks_shared(reln, forknum);
 	if (result != InvalidBlockNumber)
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6b4c84362..09f17bff92 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -51,6 +51,10 @@ typedef struct SMgrRelationData
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
 
+	/* pointer to shared object, valid if non-NULL and generation matches */
+	SMgrSharedRelation *smgr_shared;
+	uint64		smgr_shared_generation;
+
 	BlockNumber smgr_targblock; /* current insertion target block */
 
 	/* additional public fields may someday exist here */
-- 
2.30.2

v5-0003-update-fifo-to-lru-to-sweep-a-valid-cache.patchtext/x-patch; charset=US-ASCII; name=v5-0003-update-fifo-to-lru-to-sweep-a-valid-cache.patchDownload

From 70af7f6c97efb556d044e494416439a0adf13473 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 11 Mar 2021 21:09:40 +1300
Subject: [PATCH v5 3/3] update fifo to lru to sweep a valid cache

Author: <buzhen.cjx@alibaba-inc.com>
---
 src/backend/storage/smgr/smgr.c | 55 +++++++++++++++++++++++++++++----
 src/backend/utils/misc/guc.c    | 10 ++++++
 src/include/storage/smgr.h      |  1 +
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ca877dc4af..d2cf20432c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -51,6 +51,7 @@ struct SMgrSharedRelation
 	BlockNumber		nblocks[MAX_FORKNUM + 1];
 	pg_atomic_uint32 flags;
 	pg_atomic_uint64 generation;		/* mapping change */
+	int64 usecount;     /* used for clock sweep */
 };
 
 /* For now, we borrow the buffer managers array of locks.  XXX fixme */
@@ -155,7 +156,8 @@ static dlist_head unowned_relns;
 static void smgrshutdown(int code, Datum arg);
 
 /* GUCs. */
-int smgr_shared_relations = 1000;
+int smgr_shared_relations = 10000;
+int smgr_pool_sweep_times = 32;
 
 /*
  * Try to get the size of a relation's fork without locking.
@@ -179,8 +181,13 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
 		 * still refers to the same rnode before trusting the answer.
 		 */
 		pg_read_barrier();
+
 		if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation)
+		{
+			/* no necessary to use a atomic operation, usecount can be imprecisely */
+			sr->usecount++;
 			return result;
+		}
 
 		/*
 		 * The generation doesn't match, the shared relation must have been
@@ -219,6 +226,9 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum)
 		sr = &sr_pool->objects[mapping->index];
 		result = sr->nblocks[forknum];
 
+		/* no necessary to use a atomic operation, usecount can be imprecisely */
+		sr->usecount++;
+
 		/* We can take the fast path until this SR is eventually evicted. */
 		reln->smgr_shared = sr;
 		reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation);
@@ -269,6 +279,36 @@ smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
 	pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
 }
 
+/* LRU: sweep to find a sr to use. Just lock the sr when it returns */
+static SMgrSharedRelation *
+smgr_pool_sweep(void)
+{
+	SMgrSharedRelation *sr;
+	uint32 index;
+	uint32 flags;
+	int sr_used_count = 0;
+
+	for (;;)
+	{
+		/* Lock the next one in clock-hand order. */
+		index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
+		sr = &sr_pool->objects[index];
+		flags = smgr_lock_sr(sr);
+		if (--(sr->usecount) <= 0)
+		{
+			elog(DEBUG5, "find block cache in sweep cache, use it");
+			return sr;
+		}
+		if (++sr_used_count >= smgr_shared_relations * smgr_pool_sweep_times)
+		{
+			elog(LOG, "all the block caches are used frequently, use a random one");
+			sr = &sr_pool->objects[random() % smgr_shared_relations];
+			return sr;
+		}
+		smgr_unlock_sr(sr, flags);
+	}
+}
+
 /*
  * Allocate a new invalid SMgrSharedRelation, and return it locked.
  *
@@ -287,11 +327,8 @@ smgr_alloc_sr(void)
 	uint32 hash;
 
  retry:
-	/* Lock the next one in clock-hand order. */
-	index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
-	sr = &sr_pool->objects[index];
-	flags = smgr_lock_sr(sr);
-
+	sr = smgr_pool_sweep();
+	flags = pg_atomic_read_u32(&sr->flags);
 	/* If it's unused, can return it, still locked, immediately. */
 	if (!(flags & SR_VALID))
 		return sr;
@@ -301,6 +338,7 @@ smgr_alloc_sr(void)
 	 * locks, but we need to do it in that order, so we'll unlock the SR
 	 * first.
 	 */
+	index = sr - sr_pool->objects;
 	rnode = sr->rnode;
 	smgr_unlock_sr(sr, flags);
 
@@ -480,6 +518,8 @@ smgrnblocks_update(SMgrRelation reln,
 			 */
 			sr->nblocks[forknum] = nblocks;
 		}
+		if (sr->usecount < smgr_pool_sweep_times)
+			sr->usecount++;
 		smgr_unlock_sr(sr, flags);
 	}
 	LWLockRelease(mapping_lock);
@@ -502,6 +542,7 @@ smgrnblocks_update(SMgrRelation reln,
 		{
 			/* Success!  Initialize. */
 			mapping->index = sr - sr_pool->objects;
+			sr->usecount = 1;
 			smgr_unlock_sr(sr, SR_VALID);
 			sr->rnode = reln->smgr_rnode;
 			pg_atomic_write_u64(&sr->generation,
@@ -566,6 +607,7 @@ retry:
 		pg_write_barrier();
 
 		/* Mark it invalid and drop the mapping. */
+		sr->usecount = 0;
 		smgr_unlock_sr(sr, ~SR_VALID);
 		hash_search_with_hash_value(sr_mapping_table,
 									rnode,
@@ -618,6 +660,7 @@ smgr_shmem_init(void)
 		{
 			pg_atomic_init_u32(&sr_pool->objects[i].flags, 0);
 			pg_atomic_init_u64(&sr_pool->objects[i].generation, 0);
+			sr_pool->objects[i].usecount = 0;
 		}
 	}
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9daf04a5bc..037fc42cbc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2677,6 +2677,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_pool_sweep_times", PGC_SIGHUP, UNGROUPED,
+			gettext_noop("Sets the times for smgr pool to clock sweep"),
+			NULL
+		},
+		&smgr_pool_sweep_times,
+		32, 16, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 09f17bff92..18109f90b2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -20,6 +20,7 @@
 
 /* GUCs. */
 extern int smgr_shared_relations;
+extern int smgr_pool_sweep_times;
 
 /* Definition private to smgr.c. */
 struct SMgrSharedRelation;
-- 
2.30.2

Anastasia Lubennikova

lubennikovaav@gmail.com

over 4 years ago

In reply to: Thomas Munro (#55)

Re: 回复：Re: Cache relation sizes?

On Wed, Jun 16, 2021 at 9:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:

No change yet, just posting a rebase to keep cfbot happy.

Hi, Thomas

I think that the proposed feature is pretty cool not only because it fixes
some old issues with lseek() performance and reliability, but also because
it opens the door to some new features, such as disk size quota management
[1]: https://github.com/greenplum-db/diskquota

I've only read the 1st patch so far. Here are some thoughts about it:

- SMgrSharedRelation - what about calling it SMgrShmemRelation. It looks
weird too, but "...SharedRelation" makes me think of shared catalog tables.

- We can exclude temp relations from consideration as they are never
shared and use RelFileNode instead of RelFileNodeBackend in
SMgrSharedRelationMapping.
Not only it will save us some space, but also it will help to avoid a
situation when temp relations occupy whole cache (which may easily happen
in some workloads).
I assume you were trying to make a generic solution, but in practice, temp
rels differ from regular ones and may require different optimizations.
Local caching will be enough for them if ever needed, for example, we can
leave smgr_cached_nblocks[] for temp relations.

int smgr_shared_relations = 1000;

- How bad would it be to keep cache for all relations? Let's test memory
overhead on some realistic datasets. I suppose this 1000 default was chosen
just for WIP patch.
I wonder if we can use DSM instead of regular shared memory?

- I'd expect that CREATE DATABASE and ALTER DATABASE SET TABLESPACE require
special treatment, because they bypass smgr, just like dropdb. Don't see it
in the patch.

[1]: https://github.com/greenplum-db/diskquota

--
Best regards,
Lubennikova Anastasia