[Patch] ALTER SYSTEM READ ONLY

Started by amul sulover 5 years ago199 messages

sulamul@gmail.com

over 5 years ago

6 attachment(s)

Hi,

Attached patch proposes $Subject feature which forces the system into
read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

The high-level goal is to make the availability/scale-out situation
better. The feature
will help HA setup where the master server needs to stop accepting WAL
writes
immediately and kick out any transaction expecting WAL writes at the end,
in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without
needing to shut
down the master. You can instead make the master read-only, wait until the
standby
catches up, and then promote the standby. The master remains available for
read
queries throughout, and also for WAL streaming, but without the possibility
of any
new write transactions. After switchover is complete, the master can be
shut down
and brought back up as a standby without needing to use pg_rewind.
(Eventually, it
would be nice to be able to make the read-only master into a standby
without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect that
the master
has lost network connectivity to the standby, you might make it read-only
after 30 s,
and promote the standby after 60 s, so that you never have two writable
masters at
the same time. In this case, there's still some split-brain, but it's still
better than what
we have now.

Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to
coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and
WAL
permitted state respectively. When the checkpointer process sees the WAL
prohibit
state change request, it emits a global barrier and waits until all
backends that
participate in the ProcSignal absorbs it. Once it has done the WAL
read/write state in
share memory and control file will be updated so that XLogInsertAllowed()
returns
accordingly.

If there are open transactions that have acquired an XID, the sessions are
killed
before the barrier is absorbed. They can't commit without writing WAL, and
they
can't abort without writing WAL, either, so we must at least abort the
transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all
cases because
(1) if there are subtransactions active, we need to force the top-level
abort record to
be written immediately, but we can't really do that while keeping the
subtransactions
on the transaction stack, and (2) if the session is idle, we also need the
top-level abort
record to be written immediately, but can't send an error to the client
until the next
command is issued without losing wire protocol synchronization. For now, we
just use
FATAL to kill the session; maybe this can be improved in the future.

Open transactions that don't have an XID are not killed, but will get an
ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring
an XID (e.g. VACUUM).
To make that happen, the patch adds a new coding rule: a critical section
that will write
WAL must be preceded by a call to CheckWALPermitted(),
AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for
certain
that inserting WAL here must be OK, either because we have an XID (we would
have
been killed by a change to read-only if one had occurred) or for some other
reason.

The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the
control
file, so that changes survive a restart.

The transition between read-write and read-only is a pretty major
transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY |
WRITE}
command. Also, we have added a new GUC system_is_read_only which returns
"on"
when the system is in WAL prohibited state or recovery.

Another part of the patch that quite uneasy and need a discussion is that
when the
shutdown in the read-only state we do skip shutdown checkpoint and at a
restart, first
startup recovery will be performed and latter the read-only state will be
restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or
not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will
never happen
even if it's later put back into read-write mode. Thoughts?

Quick demo:
----------------
We have few active sessions, section 1 has performed some writes and stayed
in the
idle state for some time, in between in session 2 where superuser
successfully changed
system state in read-only via ALTER SYSTEM READ ONLY command which kills
session 1. Any other backend who is trying to run write transactions
thereafter will see
a read-only system error.

------------- SESSION 1 -------------
session_1=# BEGIN;
BEGIN
session_1=*# CREATE TABLE foo AS SELECT i FROM generate_series(1,5) i;
SELECT 5

------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ ONLY;
ALTER SYSTEM

------------- SESSION 1 -------------
session_1=*# COMMIT;
FATAL: system is now read only
HINT: Cannot continue a transaction if it has performed writes while
system is read only.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

------------- SESSION 4 -------------
session_4=# CHECKPOINT;
ERROR: system is now read only

System can put back to read-write mode by "ALTER SYSTEM READ WRITE" :

------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ WRITE;
ALTER SYSTEM

------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
CREATE TABLE

------------- SESSION 4 -------------
session_4=# CHECKPOINT;
CHECKPOINT

TODOs:
-----------
1. Documentation.

Attachments summary:
------------------------------
I tried to split the changes so that it can be easy to read and see the
incremental implementation.

0001: Patch by Robert, to add ability support error in global barrier
absorption.
0002: Patch implement ALTER SYSTEM { READ | WRITE} syntax and psql tab
completion support for it.
0003: A basic implementation where the system can accept $Subject command
and change system to read-only by an emitting barrier.
0004: Patch does the enhancing where the backed execute $Subject command
only and places a request to the checkpointer which is
responsible to change
the state by the emitting barrier. Also, store the state into the
control file to
make It persists across the server restarts.
0005: Patch tightens the check to prevent error in the critical section.
0006: Documentation - WIP

Credit:
-------
The feature is one of the part of Andres Frued's high-level design ideas
for inbuilt
graceful failover for PostgreSQL. Feature implementation design by Robert
Haas.
Initial patch by Amit Khandekar further works and improvement by me under
Robert's
guidance includes this mail writeup as well.

Ref:
----
1] Global barrier commit # 16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b

Thank you !

Regards,
Amul Sul
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v1-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/octet-stream; name=v1-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From c7071ed91df1b48d97d06398024b512ef178e284 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 16 Jun 2020 06:35:41 -0400
Subject: [PATCH v1 5/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 16 ++++++++
 src/backend/access/brin/brin_revmap.c     |  8 ++++
 src/backend/access/gin/ginbtree.c         | 17 ++++++--
 src/backend/access/gin/gindatapage.c      | 14 ++++++-
 src/backend/access/gin/ginfast.c          |  8 ++++
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          | 10 ++++-
 src/backend/access/gin/ginvacuum.c        |  9 +++++
 src/backend/access/gist/gist.c            | 16 ++++++++
 src/backend/access/gist/gistvacuum.c      |  9 +++++
 src/backend/access/hash/hash.c            | 13 ++++++
 src/backend/access/hash/hashinsert.c      |  8 ++++
 src/backend/access/hash/hashovfl.c        | 14 +++++++
 src/backend/access/hash/hashpage.c        | 13 ++++++
 src/backend/access/heap/heapam.c          | 32 +++++++++++++++
 src/backend/access/heap/pruneheap.c       |  7 +++-
 src/backend/access/heap/vacuumlazy.c      | 13 ++++++
 src/backend/access/heap/visibilitymap.c   | 20 ++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  4 ++
 src/backend/access/nbtree/nbtinsert.c     | 13 +++++-
 src/backend/access/nbtree/nbtpage.c       | 24 +++++++++++
 src/backend/access/spgist/spgdoinsert.c   | 19 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 13 ++++++
 src/backend/access/transam/multixact.c    |  6 ++-
 src/backend/access/transam/twophase.c     | 10 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  7 ++++
 src/backend/access/transam/xlog.c         | 27 +++++++++----
 src/backend/access/transam/xloginsert.c   | 13 +++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 +++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 11 ++++-
 src/backend/storage/lmgr/lock.c           |  6 +--
 src/backend/utils/cache/relmapper.c       |  4 ++
 src/include/access/walprohibit.h          | 49 ++++++++++++++++++++++-
 src/include/miscadmin.h                   | 27 +++++++++++++
 40 files changed, 490 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0c..ef002a51773 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -758,6 +759,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..197e1213137 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
@@ -240,6 +245,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(idxrel))
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -881,6 +894,9 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 9c4b3e22021..80b6e826ae7 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -405,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 				(errmsg("leftover placeholder tuple detected in BRIN index \"%s\", deleting",
 						RelationGetRelationName(idxrel))));
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(idxrel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -614,6 +619,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8d08b05f515..1b835b3000b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -333,6 +334,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -378,6 +380,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -386,10 +389,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -410,7 +417,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -548,6 +555,10 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -588,7 +599,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..226cb3ce44b 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -836,7 +837,11 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 		}
 
 		if (RelationNeedsWAL(indexrel))
+		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -1777,6 +1782,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1831,18 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..d7781de7674 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,9 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -587,7 +591,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * critical section.
 		 */
 		if (RelationNeedsWAL(index))
+		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..d957aa6e582 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedbc..938089238da 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,19 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..36a884af597 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -159,6 +160,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(gvs->index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -650,6 +655,10 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d62..8f6b15d8ee4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -134,6 +135,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -467,6 +471,10 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (!is_build && RelationNeedsWAL(rel))
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -525,6 +533,10 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -1665,6 +1677,10 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 	if (ndeletable > 0)
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..ccf9bc0c214 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -341,6 +342,10 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(rel))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -634,6 +639,10 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(info->index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e77..1d3f4c92f19 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -572,6 +573,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -787,6 +792,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(rel))
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -882,6 +891,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..360e30456fe 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,9 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -370,6 +374,10 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..5abba14899e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,9 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -577,6 +581,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	if (RelationNeedsWAL(rel))
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -929,7 +937,13 @@ readpage:
 					 * WAL for that.
 					 */
 					if (RelationNeedsWAL(rel))
+					{
+						/*
+						 * Can reach here from VACUUM, so need not have an XID
+						 */
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..faad58297d2 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,9 @@ restart_expand:
 		goto fail;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1176,9 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+					AssertWALPermitted_HaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1230,9 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+			AssertWALPermitted_HaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1279,9 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d1bb3..c52200463a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -46,6 +46,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1870,6 +1871,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2143,6 +2147,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2661,6 +2668,9 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3413,6 +3423,9 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3586,6 +3599,9 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4519,6 +4535,9 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5310,6 +5329,9 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5468,6 +5490,9 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5576,6 +5601,9 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5722,6 +5750,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(relation))
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 1794cfd8d9a..e7fcfb02864 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -81,7 +82,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * clean the page. The master will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -225,6 +226,10 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 									 &prstate);
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(relation))
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3bef0e124ba..3613b7a88d6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -1195,6 +1196,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				/* Can reach here from VACUUM, so need not have an XID */
+				if (RelationNeedsWAL(onerel))
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1463,6 +1468,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(onerel))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1914,6 +1923,10 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, NULL);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(onerel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40d..30d1d6f34c7 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -270,6 +271,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (RelationNeedsWAL(rel))
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -476,6 +487,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +501,14 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index b20faf693da..2b29e8f4fef 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -273,6 +274,9 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 55fe16bd4e1..b88ec09a397 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1245,6 +1246,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1903,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2468,6 +2476,9 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 75628e0eb98..09b45fbb559 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -201,6 +202,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	LockBuffer(metabuf, BT_WRITE);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -376,6 +381,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -1068,6 +1077,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1195,6 +1208,9 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1812,6 +1828,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2168,6 +2188,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..003b5e80f21 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,9 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +462,9 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1116,9 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1527,9 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1616,9 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1804,9 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..39bace9e490 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -323,6 +324,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -447,6 +452,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -505,6 +514,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ce84dac0c40..2b7b2ccad31 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,9 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2942,7 +2946,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e1904877faa..1d8782237ff 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1112,6 +1113,9 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	/* Recording transaction prepares, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2204,6 +2208,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2294,6 +2301,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e3..365de44321d 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -73,6 +74,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* Cannot assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextFullXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index a8cda2fafbc..896f0917cef 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -16,6 +16,16 @@
 #include "postmaster/bgwriter.h"
 #include "storage/procsignal.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermitted_HaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * ProcessBarrierWALProhibit()
  *
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cf15eca53ef..facd0a51f2e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1290,6 +1291,9 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		/* We'll be reaching here with valid XID only. */
+		AssertWALPermitted_HaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1650,6 +1654,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermitted_HaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ded36113d1a..90cd534baab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1024,7 +1024,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2859,9 +2859,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8830,6 +8832,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8859,6 +8863,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9087,6 +9093,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9244,6 +9252,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9877,7 +9887,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9891,10 +9901,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9916,8 +9926,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09eb..d69f6ca427a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -124,9 +125,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -204,6 +210,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..f961178b358 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1e80d53c18a..18e1007e145 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -934,6 +934,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 29c920800a6..ba74ddcd249 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3603,13 +3603,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 95a21f6cc38..5faa69fabb9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,20 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +314,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 7fecb381625..df0b14dafba 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..ec48073bbf1 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,9 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 163fe0d2fce..1adcfc571d6 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -19,8 +19,8 @@ extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
@@ -29,4 +29,49 @@ extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
  */
 #define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermitted_HaveXID(void)
+{
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * then it won't be killed while changing the system state to WAL prohibited.
+ * Therefore, we need to explicitly error out before entering into the critical
+ * section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 18bc8a7b904..63459305383 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v1-0006-Documentation-WIP.patchapplication/octet-stream; name=v1-0006-Documentation-WIP.patchDownload

From 33b8057d8789418c02b6f3afd954a17c5c340986 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 2 Jun 2020 00:45:20 -0400
Subject: [PATCH v1 6/6] Documentation - WIP

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 59 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 13 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..f62929f1660 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -433,8 +433,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -477,6 +477,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -522,7 +570,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -630,8 +679,8 @@ If the buffer is clean and checksums are in use then
 MarkBufferDirtyHint() inserts an XLOG_FPI record to ensure that we
 take a full page image that includes the hint. We do this to avoid
 a partial page write, when we write the dirtied page. WAL is not
-written during recovery, so we simply skip dirtying blocks because
-of hints when in recovery.
+written while in read only (i.e. during recovery or in WAL prohibit state), so
+we simply skip dirtying blocks because of hints when in read only.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index 4e45bd92abc..e5a32e53649 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,10 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the master.
+New WAL records cannot be written during recovery or while in WAL prohibit
+state, so hint bits set during read only system state must not dirty the page if
+the buffer is not already dirty, when checksums are enabled.  Systems in
+Hot-Standby mode may benefit from hint bits being set, but with checksums
+enabled, a page cannot be dirtied after setting a hint bit (due to the torn page
+risk). So, it must wait for full-page images containing the hint bit updates to
+arrive from the master.
-- 
2.18.0

v1-0002-Add-alter-system-read-only-write-syntax.patchapplication/octet-stream; name=v1-0002-Add-alter-system-read-only-write-syntax.patchDownload

From 1ed99836726515a31548084894470deb5636ed5c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v1 2/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 8 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d08..19aa6a2f88b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(WALProhibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5406,6 +5415,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 627b026b195..01cedb38115 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(WALProhibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3458,6 +3464,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e669d75a5af..f97bd6f658e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -480,6 +480,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10173,8 +10174,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->WALProhibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 97cbaa3072b..900088a2209 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2772,6 +2779,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3636,3 +3644,15 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index eb018854a5c..d586cf74816 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1858,9 +1858,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 5e1ffafb91b..636654bb450 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3194,6 +3194,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c65a55257dd..eb48b29828e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.18.0

v1-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/octet-stream; name=v1-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From deb5ca23539bd1bd8c19ec1620909ff118b350f7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 16 Jun 2020 06:32:01 -0400
Subject: [PATCH v1 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
    PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
    barrier has been absorbed by all the backends.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction which yet to get XID
    assigned we don't need to do anything special, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (from existing or new backend) starts as a read-only
    transaction.

 5. Auxiliary processes like autovacuum launcher, background writer,
    checkpointer and  walwriter will don't do anything in WAL-Prohibited
    server state until someone wakes us up. E.g. a backend might later on
    request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well)

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. Only super user can toggle WAL-Prohibit state.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |  1 +
 src/backend/access/transam/walprohibit.c | 81 ++++++++++++++++++++++++
 src/backend/access/transam/xact.c        | 49 ++++++++------
 src/backend/access/transam/xlog.c        | 72 ++++++++++++++++++---
 src/backend/postmaster/autovacuum.c      |  4 ++
 src/backend/postmaster/bgwriter.c        |  2 +-
 src/backend/postmaster/checkpointer.c    | 12 ++++
 src/backend/storage/ipc/procsignal.c     | 26 ++------
 src/backend/tcop/utility.c               | 14 +---
 src/backend/utils/misc/guc.c             | 26 ++++++++
 src/include/access/walprohibit.h         | 21 ++++++
 src/include/access/xlog.h                |  3 +
 src/include/storage/procsignal.h         |  7 +-
 13 files changed, 246 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..df97596ddf9
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "postmaster/bgwriter.h"
+#include "storage/procsignal.h"
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of killing
+		 * transaction by throwing ERROR due to following reasons that need be
+		 * thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Cannot continue a transaction if it has performed writes while system is read only.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Yet to add ALTER SYTEM READ WRITE support */
+	if (!stmt->WALProhibited)
+		elog(ERROR, "XXX: Yet to implement");
+
+	MakeReadOnlyXLOG();
+	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT));
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d365..cf15eca53ef 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1935,23 +1935,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
@@ -4873,9 +4878,11 @@ CommitSubTransaction(void)
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
-	 * think it should leave the child state in place.
+	 * think it should leave the child state in place.  Note that the upper
+	 * transaction will be a force to ready-only irrespective of its previous
+	 * status if the server state is WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
@@ -5031,9 +5038,11 @@ AbortSubTransaction(void)
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
-	 * with the commit case.
+	 * with the commit case.  Note that the upper transaction will be a force
+	 * to ready-only irrespective of its previous status if the server state is
+	 * WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	RESUME_INTERRUPTS();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc7..83919bdb1f0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -245,9 +245,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -659,6 +660,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * Protected by info_lck.
+	 */
+	bool		WALProhibited;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -7959,6 +7966,25 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+void
+MakeReadOnlyXLOG(void)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->WALProhibited = true;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->WALProhibited;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8174,9 +8200,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8190,14 +8216,25 @@ XLogInsertAllowed(void)
 		return (bool) LocalXLogInsertAllowed;
 
 	/*
-	 * Else, must check to see if we're still in recovery.
+	 * Else, must check to see if we're still in recovery
 	 */
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8213,12 +8250,20 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
+	Assert(!IsWALProhibited());
+
 	LocalXLogInsertAllowed = 1;
 
 	/* Initialize as RecoveryInProgress() would do when switching state */
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8509,7 +8554,10 @@ ShutdownXLOG(int code, Datum arg)
 
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	/*
+	 * Can't perform checkpoint or xlog rotation without writing WAL.
+	 */
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8522,6 +8570,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..f83f86994db 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427f..6c6ff7dc3af 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -268,7 +268,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b80..5e5e56d4eec 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -342,6 +342,18 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		/*
+		 * If the server is in WAL-Prohibited state then don't do anything until
+		 * someone wakes us up. E.g. a backend might later on request us to put
+		 * the system back to read-write.
+		 */
+		if (IsWALProhibited())
+		{
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 27eebdafda1..0bc9f778909 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -471,9 +471,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -515,24 +515,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, generation);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 900088a2209..2767cf18c68 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3644,15 +3644,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	/* some code */
-	elog(INFO, "AlterSystemSetWALProhibitState() called");
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 75fc6f11d6a..58b56eac21f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -221,6 +221,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -610,6 +611,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2041,6 +2043,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -11998,4 +12012,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..619c33cd780
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,21 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+
+#endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d8..ca7ae766e3f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -298,11 +298,13 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool IsWALProhibited(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -322,6 +324,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void MakeReadOnlyXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index a0c0bc3ce55..c425f1ccf48 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -47,12 +47,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
-- 
2.18.0

v1-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/octet-stream; name=v1-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From 7d19958a77d0d2d278e9b3479e80d3a29fe62433 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 May 2020 06:17:36 -0400
Subject: [PATCH v1 1/6] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c809196d06a..27eebdafda1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -447,17 +451,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -469,7 +515,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, generation);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -479,7 +525,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.18.0

v1-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchapplication/octet-stream; name=v1-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchDownload

From 73a20b949c3d1365163c702b1cf947fb54680efd Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 May 2020 06:39:43 -0400
Subject: [PATCH v1 4/6] Use checkpointer to make system READ-ONLY or
 READ-WRITE

Till the previous commit, the backend used to do this, but now the backend
requests checkpointer to do it. Checkpointer, noticing that the current state
is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
and then acknowledges back to the backend who requested the state change.

Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
prohibited state persistent across the system restarts.
---
 src/backend/access/transam/walprohibit.c |  26 ++++--
 src/backend/access/transam/xlog.c        |  71 ++++++++++++++--
 src/backend/postmaster/checkpointer.c    | 100 +++++++++++++++++++++--
 src/backend/postmaster/pgstat.c          |   3 +
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  11 +++
 src/include/access/xlog.h                |   3 +-
 src/include/catalog/pg_control.h         |   3 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 10 files changed, 202 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index df97596ddf9..a8cda2fafbc 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -30,6 +30,8 @@ ProcessBarrierWALProhibit(void)
 	 */
 	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
 	{
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
 		/*
 		 * XXX: Kill off the whole session by throwing FATAL instead of killing
 		 * transaction by throwing ERROR due to following reasons that need be
@@ -64,6 +66,8 @@ ProcessBarrierWALProhibit(void)
 void
 AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
 {
+	uint32			state;
+
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
@@ -72,10 +76,22 @@ AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
 	/* Alter WAL prohibit state not allowed during recovery */
 	PreventCommandDuringRecovery("ALTER SYSTEM");
 
-	/* Yet to add ALTER SYTEM READ WRITE support */
-	if (!stmt->WALProhibited)
-		elog(ERROR, "XXX: Yet to implement");
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return; /* server is already in the desired state */
 
-	MakeReadOnlyXLOG();
-	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT));
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	WALProhibitRequest();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 83919bdb1f0..ded36113d1a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -661,10 +662,10 @@ typedef struct XLogCtlData
 	RecoveryState SharedRecoveryState;
 
 	/*
-	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * SharedWALProhibitState indicates current WAL prohibit state.
 	 * Protected by info_lck.
 	 */
-	bool		WALProhibited;
+	uint32		SharedWALProhibitState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -7964,14 +7965,70 @@ StartupXLOG(void)
 	 */
 	if (fast_promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
+
+	/*
+	 * Update WAL prohibit state in shared memory that will decide the further
+	 * WAL insert should be allowed or not.
+	 *
+	 * XXX: if the previous recovery checkpoint was not ok which might nevery
+	 * happend even if it's later put back into read-write mode. What to do
+	 * then?
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = ControlFile->wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+	SpinLockRelease(&XLogCtl->info_lck);
+	ResetLocalXLogInsertAllowed(); /* return to "check" state */
 }
 
-void
-MakeReadOnlyXLOG(void)
+/* Atomically return the current server WAL prohibited state */
+uint32
+GetWALProhibitState(void)
+{
+	uint32		state;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	state = XLogCtl->SharedWALProhibitState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return state;
+}
+
+/*
+ * SetWALProhibitState: Change current wal prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
 {
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Update new state in share memory */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->WALProhibited = true;
+	XLogCtl->SharedWALProhibitState = new_state;
 	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Update control file if it is the final state */
+	if (!(new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		bool	wal_prohibited = (new_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->wal_prohibited = wal_prohibited;
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	return true;
 }
 
 /*
@@ -7980,9 +8037,7 @@ MakeReadOnlyXLOG(void)
 bool
 IsWALProhibited(void)
 {
-	volatile XLogCtlData *xlogctl = XLogCtl;
-
-	return xlogctl->WALProhibited;
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e5e56d4eec..1e80d53c18a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -127,6 +128,8 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
+	ConditionVariable readonly_cv; /* signaled when ckpt_started advances */
+
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
@@ -168,6 +171,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void performWALProhibitStateChange(uint32 wal_state);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -332,6 +336,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -342,18 +347,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
-		/*
-		 * If the server is in WAL-Prohibited state then don't do anything until
-		 * someone wakes us up. E.g. a backend might later on request us to put
-		 * the system back to read-write.
-		 */
-		if (IsWALProhibited())
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			performWALProhibitStateChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
 		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
 			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
 							 WAIT_EVENT_CHECKPOINTER_MAIN);
 			continue;
 		}
 
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -891,6 +906,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->readonly_cv);
 	}
 }
 
@@ -1121,6 +1137,78 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	return true;
 }
 
+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
+	for (;;)
+	{
+		/*  We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+		ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
+							   WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+	elog(DEBUG1, "Done WALProhibitRequest");
+}
+
+/*
+ * performWALProhibitStateChange: checkpointer will call this to complete
+ * the requested WAL prohibit state transition.
+ */
+static void
+performWALProhibitStateChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/* Must be called from checkpointer */
+	Assert(AmCheckpointerProcess());
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "Checkpointer: waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+			ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&CheckpointerShmem->readonly_cv);
+}
+
 /*
  * CompactCheckpointerRequestQueue
  *		Remove duplicates from the request queue to avoid backend fsyncs.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e96134dac8a..b5d85d35938 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4054,6 +4054,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df744..9594df76946 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 619c33cd780..163fe0d2fce 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -18,4 +18,15 @@
 extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ca7ae766e3f..060bfa4acf3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,7 +324,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
-extern void MakeReadOnlyXLOG(void);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(uint32 new_state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e5382..b32c7723275 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481ca..4bd0193e035 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -954,6 +954,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..e8271b49f6d 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -35,6 +35,8 @@ extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
+extern void WALProhibitRequest(void);
+
 extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
-- 
2.18.0

Amit Kapila

amit.kapila16@gmail.com

over 5 years ago

In reply to: amul sul (#1)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:

Hi,

Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

The high-level goal is to make the availability/scale-out situation better. The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.

Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and WAL
permitted state respectively. When the checkpointer process sees the WAL prohibit
state change request, it emits a global barrier and waits until all backends that
participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
share memory and control file will be updated so that XLogInsertAllowed() returns
accordingly.

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.

What about prepared transactions?

They can't commit without writing WAL, and they
can't abort without writing WAL, either, so we must at least abort the transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all cases because
(1) if there are subtransactions active, we need to force the top-level abort record to
be written immediately, but we can't really do that while keeping the subtransactions
on the transaction stack, and (2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.

Open transactions that don't have an XID are not killed, but will get an ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

To make that happen, the patch adds a new coding rule: a critical section that will write
WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
that inserting WAL here must be OK, either because we have an XID (we would have
been killed by a change to read-only if one had occurred) or for some other reason.

The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the control
file, so that changes survive a restart.

The transition between read-write and read-only is a pretty major transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
command. Also, we have added a new GUC system_is_read_only which returns "on"
when the system is in WAL prohibited state or recovery.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

tushar

tushar.ahuja@enterprisedb.com

over 5 years ago

In reply to: amul sul (#1)

Re: [Patch] ALTER SYSTEM READ ONLY

On 6/16/20 7:25 PM, amul sul wrote:

Attached patch proposes $Subject feature which forces the system into
read-only
mode where insert write-ahead log will be prohibited until ALTER
SYSTEM READ
WRITE executed.

Thanks Amul.

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be
back to normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

so that ALTER SYSTEM command syntax should be same for all.

postgres=# \h alter system
Command: ALTER SYSTEM
Description: change a server configuration parameter
Syntax:
ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' |
DEFAULT }

ALTER SYSTEM RESET configuration_parameter
ALTER SYSTEM RESET ALL

How we are going to justify this in help command of ALTER SYSTEM ?

2)When i connected to postgres in a single user mode , i was not able to
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres

PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR: checkpointer is not running

backend>

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Amit Kapila (#2)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.
But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.
(I'm not sure if this is how the patch currently works; I'm describing
how I think it should work.)

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.

What about prepared transactions?

They don't matter. The problem with a running transaction that has an
XID is that somebody might end the session, and then we'd have to
write either a commit record or an abort record. But a prepared
transaction doesn't have that problem. You can't COMMIT PREPARED or
ROLLBACK PREPARED while the system is read-only, as I suppose anybody
would expect, but their mere existence isn't a problem.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then. But I think right now the patch
performs the end-of-recovery checkpoint before restoring the read-only
state, which seems 100% wrong to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: tushar (#3)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote:

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be
back to normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

No, this needs to be separate from the GUC-modification syntax, I
think. It's a different kind of state change. It doesn't, and can't,
just edit postgresql.auto.conf.

2)When i connected to postgres in a single user mode , i was not able to
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres

PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR: checkpointer is not running

backend>

Hmm, that's an interesting finding. I wonder what happens if you make
the system read only, shut it down, and then restart it in single-user
mode. Given what you see here, I bet you can't put it back into a
read-write state from single user mode either, which seems like a
problem. Either single-user mode should allow changing between R/O and
R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
ONLY and always allow writes anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Amit Kapila (#2)

Re: [Patch] ALTER SYSTEM READ ONLY

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:

Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well?

I think this is a really bad idea and should simply be rejected.

Aside from the points you mention, such a switch would break autovacuum.
It would break the ability for scans to do HOT-chain cleanup, which would
likely lead to some odd behaviors (if, eg, somebody flips the switch
between where that's supposed to happen and where an update needs to
happen on the same page). It would break the ability for indexscans to do
killed-tuple marking, which is critical for performance in some scenarios.
It would break the ability to set tuple hint bits, which is even more
critical for performance. It'd possibly break, or at least complicate,
logic in index AMs to deal with index format updates --- I'm fairly sure
there are places that will try to update out-of-date data structures
rather than cope with the old structure, even in nominally read-only
searches.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems. Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first. Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on. I also wonder
what this accomplishes that couldn't be done much more simply by killing
the walsenders.

In short, I see a huge amount of complexity here, an ongoing source of
hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#6)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aside from the points you mention, such a switch would break autovacuum.
It would break the ability for scans to do HOT-chain cleanup, which would
likely lead to some odd behaviors (if, eg, somebody flips the switch
between where that's supposed to happen and where an update needs to
happen on the same page). It would break the ability for indexscans to do
killed-tuple marking, which is critical for performance in some scenarios.
It would break the ability to set tuple hint bits, which is even more
critical for performance. It'd possibly break, or at least complicate,
logic in index AMs to deal with index format updates --- I'm fairly sure
there are places that will try to update out-of-date data structures
rather than cope with the old structure, even in nominally read-only
searches.

This seems like pretty dubious hand-waving. Of course, things that
write WAL are going to be broken by a switch that prevents writing
WAL; but if they were not, there would be no purpose in having such a
switch, so that's not really an argument. But you seem to have mixed
in some things that don't require writing WAL, and claimed without
evidence that those would somehow also be broken. I don't think that's
the case, but even if it were, so what? We live with all of these
restrictions on standbys anyway.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems. Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

I mean, the syntax that we use for a feature like this is arbitrary. I
picked this one, so I like it, but it can easily be changed if other
people want something else. The rest of this argument doesn't seem to
me to make very much sense. The existing ALTER SYSTEM functionality to
modify a text configuration file isn't replicated today and I'm not
sure why we should make it so, considering that replication generally
only considers things that are guaranteed to be the same on the master
and the standby, which this is not. But even if we did, that has
nothing to do with whether some functionality that changes the system
state without changing a text file ought to also be replicated. This
is a piece of cluster management functionality and it makes no sense
to replicate it. And no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.
That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.
The walsenders are supposed to send all the WAL from the master before
exiting, but if the connection is broken for some reason, then the
master is down and the standbys can't stream the rest of the WAL. You
can start it up again, but then you might generate more WAL. You can
try to copy the WAL around manually from one pg_wal directory to
another, but that's not a very nice thing for users to need to do
manually, and seems buggy and error-prone.

And how do you figure out where the WAL ends on the master and make
sure that the standby replayed it all? If the master is up, it's easy:
you just use the same queries you use all the time. If the master is
down, you have to use some different technique that involves manually
examining files or scrutinizing pg_controldata output. It's actually
very difficult to get this right.

Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

I also wonder
what this accomplishes that couldn't be done much more simply by killing
the walsenders.

Killing the walsenders does nothing ... the clients immediately reconnect.

In short, I see a huge amount of complexity here, an ongoing source of
hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

I do think this is complex and the risk of bugs that are hard to
identify or hard to fix certainly needs to be considered. I
strenuously disagree with the idea that there is not very much real
usefulness. Getting failover set up in a way that actually works
robustly is, in my experience, one of the two or three most serious
challenges my employer's customers face today. The core server support
we provide for that is breathtakingly primitive, and it's urgent that
we do better. Cloud providers are moving users from PostgreSQL to
their own forks of PostgreSQL in vast numbers in large part because
users don't want to deal with this crap, and the cloud providers have
made it so they don't have to. People running PostgreSQL themselves
need complex third-party tools and even then the experience isn't as
good as what a major cloud provider would offer. This patch is not
going to fix that, but I think it's a step in the right direction, and
I hope others will agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#7)

Re: [Patch] ALTER SYSTEM READ ONLY

Robert Haas <robertmhaas@gmail.com> writes:

This seems like pretty dubious hand-waving. Of course, things that
write WAL are going to be broken by a switch that prevents writing
WAL; but if they were not, there would be no purpose in having such a
switch, so that's not really an argument. But you seem to have mixed
in some things that don't require writing WAL, and claimed without
evidence that those would somehow also be broken.

Which of the things I mentioned don't require writing WAL?

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems.

... no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

My point was that putting this in ALTER SYSTEM paints us into a corner
as to what we can do with ALTER SYSTEM in the future: we won't ever be
able to make that do anything that would require writing WAL. And I
don't entirely believe your argument that that will never be something
we'd want to do.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#8)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Which of the things I mentioned don't require writing WAL?

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems.

... no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

My point was that putting this in ALTER SYSTEM paints us into a corner
as to what we can do with ALTER SYSTEM in the future: we won't ever be
able to make that do anything that would require writing WAL. And I
don't entirely believe your argument that that will never be something
we'd want to do.

I think that depends a lot on how you view ALTER SYSTEM. I believe it
would be reasonable to view ALTER SYSTEM as a catch-all for commands
that make system-wide state changes, even if those changes are not all
of the same kind as each other; some might be machine-local, and
others cluster-wide; some WAL-logged, and others not. I don't think
it's smart to view ALTER SYSTEM through a lens that boxes it into only
editing postgresql.auto.conf; if that were so, we ought to have called
it ALTER CONFIGURATION FILE or something rather than ALTER SYSTEM. For
that reason, I do not see the choice of syntax as painting us into a
corner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#9)

Re: [Patch] ALTER SYSTEM READ ONLY

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Which of the things I mentioned don't require writing WAL?

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

And your point is? I thought enabling checksums was considered
good practice these days.

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

Well, if pg_demote can be done for X amount of effort, and largely
gets the job done, while this requires 10X or 100X the effort and
introduces 10X or 100X as many bugs, I'm not especially convinced
that we want both.

regards, tom lane

#11

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#10)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 12:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

And your point is? I thought enabling checksums was considered
good practice these days.

I don't want to have an argument about what typical or best practices
are; I wasn't trying to make any point about that one way or the
other. I'm just saying that the operations you listed don't
necessarily all write WAL. In an event, even if they did, the larger
point is that standbys work like that, too, so it's not unprecedented
or illogical to think of such things.

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

Well, if pg_demote can be done for X amount of effort, and largely
gets the job done, while this requires 10X or 100X the effort and
introduces 10X or 100X as many bugs, I'm not especially convinced
that we want both.

Sure: if two features duplicate each other, and one of them is way
more work and way more buggy, then it's silly to have both, and we
should just accept the easy, bug-free one. However, as I said in the
other email to which I referred you, I currently believe that these
two features actually don't duplicate each other and that using them
both together would be quite beneficial. Also, even if they did, I
don't know where you are getting the idea that this feature will be
10X or 100X more work and more buggy than the other one. I have looked
at this code prior to it being posted, but I haven't looked at the
other code at all; I am guessing that you have looked at neither. I
would be happy if you did, because it is often the case that
architectural issues that escape other people are apparent to you upon
examination, and it's always nice to know about those earlier rather
than later so that one can decide to (a) give up or (b) fix them. But
I see no point in speculating in the abstract that such issues may
exist and that they may be more severe in one case than the other. My
own guess is that, properly implemented, they are within 2-3X of each
in one direction or the other, not 10-100X. It is almost unbelievable
to me that the pg_demote patch could be 100X simpler than this one; if
it were, I'd be practically certain it was a 5-minute hack job
unworthy of any serious consideration.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#7)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On 2020-06-17 12:07:22 -0400, Robert Haas wrote:

On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems. Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

I mean, the syntax that we use for a feature like this is arbitrary. I
picked this one, so I like it, but it can easily be changed if other
people want something else. The rest of this argument doesn't seem to
me to make very much sense. The existing ALTER SYSTEM functionality to
modify a text configuration file isn't replicated today and I'm not
sure why we should make it so, considering that replication generally
only considers things that are guaranteed to be the same on the master
and the standby, which this is not. But even if we did, that has
nothing to do with whether some functionality that changes the system
state without changing a text file ought to also be replicated. This
is a piece of cluster management functionality and it makes no sense
to replicate it. And no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

I agree that the concrete syntax here doesn't seem to matter much. If
this worked by actually putting a GUC into the config file, it would
perhaps matter a bit more, but it doesn't afaict. It seems good to
avoid new top-level statements, and ALTER SYSTEM seems to fit well.

I wonder if there's an argument about wanting to be able to execute this
command over a physical replication connection? I think this feature
fairly obviously is a building block for "gracefully failover to this
standby", and it seems like it'd be nicer if that didn't potentially
require two pg_hba.conf entries for the to-be-promoted primary on the
current/old primary?

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.
That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.
The walsenders are supposed to send all the WAL from the master before
exiting, but if the connection is broken for some reason, then the
master is down and the standbys can't stream the rest of the WAL. You
can start it up again, but then you might generate more WAL. You can
try to copy the WAL around manually from one pg_wal directory to
another, but that's not a very nice thing for users to need to do
manually, and seems buggy and error-prone.

Also (I'm sure you're aware) if you just non-gracefully shut down the
old primary, you're going to have to rewind the old primary to be able
to use it as a standby. And if you non-gracefully stop you're gonna
incur checkpoint overhead, which is *massive* on non-toy
databases. There's a huge practical difference between a minor version
upgrade causing 10s of unavailability and causing 5min-30min.

And how do you figure out where the WAL ends on the master and make
sure that the standby replayed it all? If the master is up, it's easy:
you just use the same queries you use all the time. If the master is
down, you have to use some different technique that involves manually
examining files or scrutinizing pg_controldata output. It's actually
very difficult to get this right.

Yea, it's absurdly hard. I think it's really kind of ridiculous that we
expect others to get this right if we, the developers of this stuff,
can't really get it right because it's so complicated. Which imo makes
this:

Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.

more of an argument for having this type of stuff builtin.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

Yup.

Greetings,

Andres Freund

#13

amul sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#4)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.
But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.
(I'm not sure if this is how the patch currently works; I'm describing
how I think it should work.)

You are correct -- writing dirty pages is not restricted.

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.

What about prepared transactions?

They don't matter. The problem with a running transaction that has an
XID is that somebody might end the session, and then we'd have to
write either a commit record or an abort record. But a prepared
transaction doesn't have that problem. You can't COMMIT PREPARED or
ROLLBACK PREPARED while the system is read-only, as I suppose anybody
would expect, but their mere existence isn't a problem.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then. But I think right now the patch
performs the end-of-recovery checkpoint before restoring the read-only
state, which seems 100% wrong to me.

Yeah, we need more thought on how to proceed further. I am kind of agree that
the current behavior is not right with Robert since writing end-of-recovery
checkpoint violates the no-wal-write rule.

Regards,
Amul

#14

amul sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#5)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote:

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be
back to normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

No, this needs to be separate from the GUC-modification syntax, I
think. It's a different kind of state change. It doesn't, and can't,
just edit postgresql.auto.conf.

2)When i connected to postgres in a single user mode , i was not able to
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres

PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR: checkpointer is not running

backend>

Hmm, that's an interesting finding. I wonder what happens if you make
the system read only, shut it down, and then restart it in single-user
mode. Given what you see here, I bet you can't put it back into a
read-write state from single user mode either, which seems like a
problem. Either single-user mode should allow changing between R/O and
R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
ONLY and always allow writes anyway.

Ok, will try to enable changing between R/O and R/W in the next version.

Thanks Tushar for the testing.

Regards,
Amul

#15

Amit Kapila

amit.kapila16@gmail.com

over 5 years ago

In reply to: Robert Haas (#4)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.

What is the harm in doing the checkpoint before we put the system into
READ ONLY state? The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end? Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation. If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then.

But then if we have to perform recovery again, it will start from the
previous checkpoint. I think we have to live with it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

over 5 years ago

In reply to: Robert Haas (#7)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, 17 Jun 2020 12:07:22 -0400
Robert Haas <robertmhaas@gmail.com> wrote:
[...]

Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

Yes. Maybe we should make sure the wording we are using is the same for
everyone. I already hear/read "failover", "controlled failover", "switchover" or
"controlled switchover", this is confusing. My definition of switchover is:

swapping primary and secondary status between two replicating instances. With
no data loss. This is a controlled procedure where all steps must succeed to
complete.
If a step fails, the procedure fail back to the original primary with no data
loss.

However, Wikipedia has a broader definition, including situations where the
switchover is executed upon a failure: https://en.wikipedia.org/wiki/Switchover

Regards,

#17

Simon Riggs

simon@2ndquadrant.com

over 5 years ago

In reply to: amul sul (#1)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, 16 Jun 2020 at 14:56, amul sul <sulamul@gmail.com> wrote:

The high-level goal is to make the availability/scale-out situation
better. The feature
will help HA setup where the master server needs to stop accepting WAL
writes
immediately and kick out any transaction expecting WAL writes at the end,
in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without
needing to shut
down the master. You can instead make the master read-only, wait until the
standby
catches up, and then promote the standby. The master remains available for
read
queries throughout, and also for WAL streaming, but without the
possibility of any
new write transactions. After switchover is complete, the master can be
shut down
and brought back up as a standby without needing to use pg_rewind.
(Eventually, it
would be nice to be able to make the read-only master into a standby
without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect
that the master
has lost network connectivity to the standby, you might make it read-only
after 30 s,
and promote the standby after 60 s, so that you never have two writable
masters at
the same time. In this case, there's still some split-brain, but it's
still better than what
we have now.

If there are open transactions that have acquired an XID, the sessions are
killed
before the barrier is absorbed.

inbuilt graceful failover for PostgreSQL

That doesn't appear to be very graceful. Perhaps objections could be
assuaged by having a smoother transition and perhaps not even a full
barrier, initially.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
Mission Critical Databases

#18

Amit Kapila

amit.kapila16@gmail.com

over 5 years ago

In reply to: Robert Haas (#7)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 9:37 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.

Yeah, but if there is a synchronuos_standby (standby that provide sync
replication), user can always route the connections to it
(automatically if there is some middleware which can detect and route
the connection to standby)

That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.

Fair enough. However, it is not described in the initial email
(unless I have missed it; there is a mention that this patch is one
part of that bigger feature but no further explanation of that bigger
feature) how this feature will allow a clean switchover. I think
before we put the system into READ ONLY state, there could be some WAL
which we haven't sent to standby, what we do we do for that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19

amul sul

sulamul@gmail.com

over 5 years ago

In reply to: Amit Kapila (#15)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

Read-only does not restrict XLogFlush().

But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.

What is the harm in doing the checkpoint before we put the system into
READ ONLY state? The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

The checkpoint could take longer, intending to quickly switch to the read-only
state.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end? Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation. If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

No new autovacuum worker will be forked in the read-only state and existing will
have an error if they try to write WAL after barrier absorption.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then.

But then if we have to perform recovery again, it will start from the
previous checkpoint. I think we have to live with it.

Let me explain the case, if we do skip the end-of-recovery checkpoint while
starting the system in read-only mode and then later changing the state to
read-write and do a few write operations and online checkpoints, that will be
fine? I am yet to explore those things.

Regards,
Amul

#20

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Amit Kapila (#15)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

I think we should stop WAL writes, then XLogFlush() once, then declare
the system R/O. After that there might be more XLogFlush() calls but
there won't be any new WAL, so they won't do anything.

But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.

What is the harm in doing the checkpoint before we put the system into
READ ONLY state? The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

Well, as Andres says in
/messages/by-id/20200617180546.yucxtiupvxghxss6@alap3.anarazel.de
it can take a really long time.

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end?

Not if we do what I said in that paragraph. If we're not launching new
workers we can't again trigger a worker for the same relation.

Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation. If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

I don't see how this would happen. VACUUM can't really dirty pages
without writing WAL, can it? And, anyway, if there's an error, we're
not going to try again for the same relation unless we launch new
workers.

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then.

But then if we have to perform recovery again, it will start from the
previous checkpoint. I think we have to live with it.

Yeah. I don't think it's that bad. The case where you shut down the
system while it's read-only should be a somewhat unusual one. Normally
you would mark it read only and then promote a standby and shut the
old master down (or demote it). But what you want is that if it does
happen to go down for some reason before all the WAL is streamed, you
can bring it back up and finish streaming the WAL without generating
any new WAL.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: amul sul (#19)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 7:19 AM amul sul <sulamul@gmail.com> wrote:

Let me explain the case, if we do skip the end-of-recovery checkpoint while
starting the system in read-only mode and then later changing the state to
read-write and do a few write operations and online checkpoints, that will be
fine? I am yet to explore those things.

I think we'd want the FIRST write operation to be the end-of-recovery
checkpoint, before the system is fully read-write. And then after that
completes you could do other things.

It would be good if we can get an opinion from Andres about this,
since I think he has thought about this stuff quite a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Simon Riggs (#17)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 6:39 AM Simon Riggs <simon@2ndquadrant.com> wrote:

That doesn't appear to be very graceful. Perhaps objections could be assuaged by having a smoother transition and perhaps not even a full barrier, initially.

Yeah, it's not ideal, though still better than what we have now. What
do you mean by "a smoother transition and perhaps not even a full
barrier"? I think if you want to switch the primary to another machine
and make the old primary into a standby, you really need to arrest WAL
writes completely. It would be better to make existing write
transactions ERROR rather than FATAL, but there are some very
difficult cases there, so I would like to leave that as a possible
later improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

over 5 years ago

In reply to: Robert Haas (#20)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, 18 Jun 2020 10:52:49 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

[...]

But what you want is that if it does happen to go down for some reason before
all the WAL is streamed, you can bring it back up and finish streaming the
WAL without generating any new WAL.

Thanks to cascading replication, it could be very possible without this READ
ONLY mode, just in recovery mode, isn't it?

Regards,

#24

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Jehan-Guillaume de Rorthais (#23)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 11:08 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

Thanks to cascading replication, it could be very possible without this READ
ONLY mode, just in recovery mode, isn't it?

Yeah, perhaps. I just wrote an email about that over on the demote
thread, so I won't repeat it here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25

amul sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#20)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 8:23 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

I think we should stop WAL writes, then XLogFlush() once, then declare
the system R/O. After that there might be more XLogFlush() calls but
there won't be any new WAL, so they won't do anything.

Yeah, the proposed v1 patch does the same.

Regards,
Amul

#26

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: amul sul (#25)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi All,

Attaching a new set of patches rebased atop the latest master head and includes
the following changes:

1. Enabling ALTER SYSTEM READ { ONLY | WRITE } support for the single-user,
discussed here [1]

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].

3. While changing the system state to READ-WRITE, a new checkpoint request will
be made.

All these changes are part of the v2-0004 patch and the rest of the patches will
be the same as the v1.

Regards,
Amul

1] /messages/by-id/CAAJ_b96WPPt-=vyjpPUy8pG0vAvLgpjLukCZONUkvdR1_exrKA@mail.gmail.com
2] /messages/by-id/CAAJ_b95hddJrgciCfri2NkTLdEUSz6zdMSjoDuWPFPBFvJy+Kg@mail.gmail.com

Attachments:

v2-0006-Documentation-WIP.patchapplication/x-patch; name=v2-0006-Documentation-WIP.patchDownload

From b7bc7cb9c417cdb7a1607237642ac2f61fe4ba74 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 2 Jun 2020 00:45:20 -0400
Subject: [PATCH v2 6/6] Documentation - WIP

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 59 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 13 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..f62929f1660 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -433,8 +433,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -477,6 +477,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -522,7 +570,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -630,8 +679,8 @@ If the buffer is clean and checksums are in use then
 MarkBufferDirtyHint() inserts an XLOG_FPI record to ensure that we
 take a full page image that includes the hint. We do this to avoid
 a partial page write, when we write the dirtied page. WAL is not
-written during recovery, so we simply skip dirtying blocks because
-of hints when in recovery.
+written while in read only (i.e. during recovery or in WAL prohibit state), so
+we simply skip dirtying blocks because of hints when in read only.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index 4e45bd92abc..e5a32e53649 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,10 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the master.
+New WAL records cannot be written during recovery or while in WAL prohibit
+state, so hint bits set during read only system state must not dirty the page if
+the buffer is not already dirty, when checksums are enabled.  Systems in
+Hot-Standby mode may benefit from hint bits being set, but with checksums
+enabled, a page cannot be dirtied after setting a hint bit (due to the torn page
+risk). So, it must wait for full-page images containing the hint bit updates to
+arrive from the master.
-- 
2.18.0

v2-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/x-patch; name=v2-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From bf257bcfd0a5c404209521249991b63dcc79a712 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v2 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
    PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
    barrier has been absorbed by all the backends.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction which yet to get XID
    assigned we don't need to do anything special, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (from existing or new backend) starts as a read-only
    transaction.

 5. Auxiliary processes like autovacuum launcher, background writer,
    checkpointer and  walwriter will don't do anything in WAL-Prohibited
    server state until someone wakes us up. E.g. a backend might later on
    request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well)

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. Only super user can toggle WAL-Prohibit state.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |  1 +
 src/backend/access/transam/walprohibit.c | 81 ++++++++++++++++++++++++
 src/backend/access/transam/xact.c        | 49 ++++++++------
 src/backend/access/transam/xlog.c        | 72 ++++++++++++++++++---
 src/backend/postmaster/autovacuum.c      |  4 ++
 src/backend/postmaster/bgwriter.c        |  2 +-
 src/backend/postmaster/checkpointer.c    | 12 ++++
 src/backend/storage/ipc/procsignal.c     | 26 ++------
 src/backend/tcop/utility.c               | 14 +---
 src/backend/utils/misc/guc.c             | 26 ++++++++
 src/include/access/walprohibit.h         | 21 ++++++
 src/include/access/xlog.h                |  3 +
 src/include/storage/procsignal.h         |  7 +-
 13 files changed, 246 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..df97596ddf9
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "postmaster/bgwriter.h"
+#include "storage/procsignal.h"
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of killing
+		 * transaction by throwing ERROR due to following reasons that need be
+		 * thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Cannot continue a transaction if it has performed writes while system is read only.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Yet to add ALTER SYTEM READ WRITE support */
+	if (!stmt->WALProhibited)
+		elog(ERROR, "XXX: Yet to implement");
+
+	MakeReadOnlyXLOG();
+	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT));
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 905dc7d8d3b..bf699f19fe8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1937,23 +1937,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
@@ -4875,9 +4880,11 @@ CommitSubTransaction(void)
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
-	 * think it should leave the child state in place.
+	 * think it should leave the child state in place.  Note that the upper
+	 * transaction will be a force to ready-only irrespective of its previous
+	 * status if the server state is WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
@@ -5033,9 +5040,11 @@ AbortSubTransaction(void)
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
-	 * with the commit case.
+	 * with the commit case.  Note that the upper transaction will be a force
+	 * to ready-only irrespective of its previous status if the server state is
+	 * WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	RESUME_INTERRUPTS();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1256a103b6..ffa25b9d9b2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -245,9 +245,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -659,6 +660,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * Protected by info_lck.
+	 */
+	bool		WALProhibited;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -7959,6 +7966,25 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+void
+MakeReadOnlyXLOG(void)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->WALProhibited = true;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->WALProhibited;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8174,9 +8200,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8190,14 +8216,25 @@ XLogInsertAllowed(void)
 		return (bool) LocalXLogInsertAllowed;
 
 	/*
-	 * Else, must check to see if we're still in recovery.
+	 * Else, must check to see if we're still in recovery
 	 */
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8213,12 +8250,20 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
+	Assert(!IsWALProhibited());
+
 	LocalXLogInsertAllowed = 1;
 
 	/* Initialize as RecoveryInProgress() would do when switching state */
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8509,7 +8554,10 @@ ShutdownXLOG(int code, Datum arg)
 
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	/*
+	 * Can't perform checkpoint or xlog rotation without writing WAL.
+	 */
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8522,6 +8570,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..f83f86994db 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427f..6c6ff7dc3af 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -268,7 +268,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b80..5e5e56d4eec 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -342,6 +342,18 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		/*
+		 * If the server is in WAL-Prohibited state then don't do anything until
+		 * someone wakes us up. E.g. a backend might later on request us to put
+		 * the system back to read-write.
+		 */
+		if (IsWALProhibited())
+		{
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 13648887187..b973727a580 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 900088a2209..2767cf18c68 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3644,15 +3644,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	/* some code */
-	elog(INFO, "AlterSystemSetWALProhibitState() called");
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 75fc6f11d6a..58b56eac21f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -221,6 +221,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -610,6 +611,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2041,6 +2043,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -11998,4 +12012,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..619c33cd780
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,21 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+
+#endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57cf..6e98e27b2c8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -298,11 +298,13 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool IsWALProhibited(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -322,6 +324,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void MakeReadOnlyXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
-- 
2.18.0

v2-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/x-patch; name=v2-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From f301f644428afed7c886f5bf13451829b3e13144 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v2 1/6] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ece..13648887187 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.18.0

v2-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/x-patch; name=v2-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 15ec36c9fd9659fe8a497ef59719b5a87ca8014b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 16 Jun 2020 06:35:41 -0400
Subject: [PATCH v2 5/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 16 ++++++++
 src/backend/access/brin/brin_revmap.c     |  8 ++++
 src/backend/access/gin/ginbtree.c         | 17 ++++++--
 src/backend/access/gin/gindatapage.c      | 14 ++++++-
 src/backend/access/gin/ginfast.c          |  8 ++++
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          | 10 ++++-
 src/backend/access/gin/ginvacuum.c        |  9 +++++
 src/backend/access/gist/gist.c            | 16 ++++++++
 src/backend/access/gist/gistvacuum.c      |  9 +++++
 src/backend/access/hash/hash.c            | 13 ++++++
 src/backend/access/hash/hashinsert.c      |  8 ++++
 src/backend/access/hash/hashovfl.c        | 14 +++++++
 src/backend/access/hash/hashpage.c        | 13 ++++++
 src/backend/access/heap/heapam.c          | 32 +++++++++++++++
 src/backend/access/heap/pruneheap.c       |  7 +++-
 src/backend/access/heap/vacuumlazy.c      | 13 ++++++
 src/backend/access/heap/visibilitymap.c   | 20 ++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  4 ++
 src/backend/access/nbtree/nbtinsert.c     | 13 +++++-
 src/backend/access/nbtree/nbtpage.c       | 24 +++++++++++
 src/backend/access/spgist/spgdoinsert.c   | 19 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 13 ++++++
 src/backend/access/transam/multixact.c    |  6 ++-
 src/backend/access/transam/twophase.c     | 10 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  7 ++++
 src/backend/access/transam/xlog.c         | 27 +++++++++----
 src/backend/access/transam/xloginsert.c   | 13 +++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 +++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 11 ++++-
 src/backend/storage/lmgr/lock.c           |  6 +--
 src/backend/utils/cache/relmapper.c       |  4 ++
 src/include/access/walprohibit.h          | 49 ++++++++++++++++++++++-
 src/include/miscadmin.h                   | 27 +++++++++++++
 40 files changed, 490 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0c..ef002a51773 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -758,6 +759,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..197e1213137 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
@@ -240,6 +245,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(idxrel))
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -881,6 +894,9 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 9c4b3e22021..80b6e826ae7 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -405,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 				(errmsg("leftover placeholder tuple detected in BRIN index \"%s\", deleting",
 						RelationGetRelationName(idxrel))));
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(idxrel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -614,6 +619,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8d08b05f515..1b835b3000b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -333,6 +334,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -378,6 +380,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -386,10 +389,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -410,7 +417,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -548,6 +555,10 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -588,7 +599,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..226cb3ce44b 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -836,7 +837,11 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 		}
 
 		if (RelationNeedsWAL(indexrel))
+		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -1777,6 +1782,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1831,18 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..d7781de7674 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,9 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -587,7 +591,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * critical section.
 		 */
 		if (RelationNeedsWAL(index))
+		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..d957aa6e582 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedbc..938089238da 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,19 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..36a884af597 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -159,6 +160,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(gvs->index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -650,6 +655,10 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d62..8f6b15d8ee4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -134,6 +135,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -467,6 +471,10 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (!is_build && RelationNeedsWAL(rel))
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -525,6 +533,10 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -1665,6 +1677,10 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 	if (ndeletable > 0)
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..ccf9bc0c214 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -341,6 +342,10 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(rel))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -634,6 +639,10 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(info->index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e77..1d3f4c92f19 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -572,6 +573,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -787,6 +792,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(rel))
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -882,6 +891,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..360e30456fe 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,9 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -370,6 +374,10 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..5abba14899e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,9 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -577,6 +581,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	if (RelationNeedsWAL(rel))
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -929,7 +937,13 @@ readpage:
 					 * WAL for that.
 					 */
 					if (RelationNeedsWAL(rel))
+					{
+						/*
+						 * Can reach here from VACUUM, so need not have an XID
+						 */
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..faad58297d2 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,9 @@ restart_expand:
 		goto fail;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1176,9 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+					AssertWALPermitted_HaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1230,9 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+			AssertWALPermitted_HaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1279,9 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d1bb3..c52200463a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -46,6 +46,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1870,6 +1871,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2143,6 +2147,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2661,6 +2668,9 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3413,6 +3423,9 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3586,6 +3599,9 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4519,6 +4535,9 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5310,6 +5329,9 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5468,6 +5490,9 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5576,6 +5601,9 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5722,6 +5750,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(relation))
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 1794cfd8d9a..e7fcfb02864 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -81,7 +82,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * clean the page. The master will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -225,6 +226,10 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 									 &prstate);
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(relation))
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3bef0e124ba..3613b7a88d6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -1195,6 +1196,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				/* Can reach here from VACUUM, so need not have an XID */
+				if (RelationNeedsWAL(onerel))
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1463,6 +1468,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(onerel))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1914,6 +1923,10 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, NULL);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(onerel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40d..30d1d6f34c7 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -270,6 +271,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (RelationNeedsWAL(rel))
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -476,6 +487,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +501,14 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..a471a4b7ff5 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,9 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 55fe16bd4e1..b88ec09a397 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1245,6 +1246,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1903,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2468,6 +2476,9 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 75628e0eb98..09b45fbb559 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -201,6 +202,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	LockBuffer(metabuf, BT_WRITE);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -376,6 +381,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -1068,6 +1077,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1195,6 +1208,9 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1812,6 +1828,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2168,6 +2188,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..003b5e80f21 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,9 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +462,9 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1116,9 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1527,9 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1616,9 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1804,9 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..39bace9e490 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -323,6 +324,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -447,6 +452,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -505,6 +514,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ce84dac0c40..2b7b2ccad31 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,9 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2942,7 +2946,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0ec..0fa01f241f3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1112,6 +1113,9 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	/* Recording transaction prepares, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2204,6 +2208,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2294,6 +2301,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e3..365de44321d 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -73,6 +74,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* Cannot assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextFullXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index a8cda2fafbc..896f0917cef 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -16,6 +16,16 @@
 #include "postmaster/bgwriter.h"
 #include "storage/procsignal.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermitted_HaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * ProcessBarrierWALProhibit()
  *
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bf699f19fe8..28cccfa3de5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1292,6 +1293,9 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		/* We'll be reaching here with valid XID only. */
+		AssertWALPermitted_HaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1652,6 +1656,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermitted_HaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9eb4bf413eb..5e8512bdfca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1024,7 +1024,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2859,9 +2859,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8832,6 +8834,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8861,6 +8865,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9089,6 +9095,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9246,6 +9254,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9879,7 +9889,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9893,10 +9903,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9918,8 +9928,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09eb..d69f6ca427a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -124,9 +125,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -204,6 +210,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..f961178b358 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bbe62b73a08..7e3882a055c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -934,6 +934,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 29c920800a6..ba74ddcd249 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3603,13 +3603,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 95a21f6cc38..5faa69fabb9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,20 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +314,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79bd..212312d5ae5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..ec48073bbf1 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,9 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 163fe0d2fce..1adcfc571d6 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -19,8 +19,8 @@ extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
@@ -29,4 +29,49 @@ extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
  */
 #define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermitted_HaveXID(void)
+{
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * then it won't be killed while changing the system state to WAL prohibited.
+ * Therefore, we need to explicitly error out before entering into the critical
+ * section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 18bc8a7b904..63459305383 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v2-0002-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v2-0002-Add-alter-system-read-only-write-syntax.patchDownload

From bc61aef9b1749d503ab9c020f83d48c357f80122 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v2 2/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 8 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d08..19aa6a2f88b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(WALProhibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5406,6 +5415,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 627b026b195..01cedb38115 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(WALProhibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3458,6 +3464,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e669d75a5af..f97bd6f658e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -480,6 +480,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10173,8 +10174,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->WALProhibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 97cbaa3072b..900088a2209 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2772,6 +2779,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3636,3 +3644,15 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index eb018854a5c..d586cf74816 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1858,9 +1858,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 5e1ffafb91b..636654bb450 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3194,6 +3194,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c65a55257dd..eb48b29828e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.18.0

v2-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchapplication/x-patch; name=v2-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchDownload

From c31d946ff7e0531c68993d9ada89dcb54da07add Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 May 2020 06:39:43 -0400
Subject: [PATCH v2 4/6] Use checkpointer to make system READ-ONLY or
 READ-WRITE

Till the previous commit, the backend used to do this, but now the backend
requests checkpointer to do it. Checkpointer, noticing that the current state
is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
and then acknowledges back to the backend who requested the state change.

Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
prohibited state persistent across the system restarts.
---
 src/backend/access/transam/walprohibit.c |  26 ++++-
 src/backend/access/transam/xlog.c        |  77 +++++++++++++--
 src/backend/postmaster/checkpointer.c    | 116 +++++++++++++++++++++--
 src/backend/postmaster/pgstat.c          |   3 +
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  11 +++
 src/include/access/xlog.h                |   3 +-
 src/include/catalog/pg_control.h         |   3 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 10 files changed, 222 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index df97596ddf9..a8cda2fafbc 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -30,6 +30,8 @@ ProcessBarrierWALProhibit(void)
 	 */
 	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
 	{
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
 		/*
 		 * XXX: Kill off the whole session by throwing FATAL instead of killing
 		 * transaction by throwing ERROR due to following reasons that need be
@@ -64,6 +66,8 @@ ProcessBarrierWALProhibit(void)
 void
 AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
 {
+	uint32			state;
+
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
@@ -72,10 +76,22 @@ AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
 	/* Alter WAL prohibit state not allowed during recovery */
 	PreventCommandDuringRecovery("ALTER SYSTEM");
 
-	/* Yet to add ALTER SYTEM READ WRITE support */
-	if (!stmt->WALProhibited)
-		elog(ERROR, "XXX: Yet to implement");
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return; /* server is already in the desired state */
 
-	MakeReadOnlyXLOG();
-	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT));
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	WALProhibitRequest();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ffa25b9d9b2..9eb4bf413eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -661,10 +662,10 @@ typedef struct XLogCtlData
 	RecoveryState SharedRecoveryState;
 
 	/*
-	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * SharedWALProhibitState indicates current WAL prohibit state.
 	 * Protected by info_lck.
 	 */
-	bool		WALProhibited;
+	uint32		SharedWALProhibitState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -7710,6 +7711,15 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion update WAL prohibit state in shared memory
+	 * that will decide the further WAL insert should be allowed or not.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = ControlFile->wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7720,7 +7730,15 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7966,12 +7984,54 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
-void
-MakeReadOnlyXLOG(void)
+/* Atomically return the current server WAL prohibited state */
+uint32
+GetWALProhibitState(void)
 {
+	uint32		state;
+
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->WALProhibited = true;
+	state = XLogCtl->SharedWALProhibitState;
 	SpinLockRelease(&XLogCtl->info_lck);
+
+	return state;
+}
+
+/*
+ * SetWALProhibitState: Change current wal prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Update new state in share memory */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = new_state;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Update control file if it is the final state */
+	if (!(new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		bool	wal_prohibited = (new_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->wal_prohibited = wal_prohibited;
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	return true;
 }
 
 /*
@@ -7980,9 +8040,7 @@ MakeReadOnlyXLOG(void)
 bool
 IsWALProhibited(void)
 {
-	volatile XLogCtlData *xlogctl = XLogCtl;
-
-	return xlogctl->WALProhibited;
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
 }
 
 /*
@@ -8250,7 +8308,6 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
-	Assert(!IsWALProhibited());
 
 	LocalXLogInsertAllowed = 1;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e5e56d4eec..bbe62b73a08 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -127,6 +128,8 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
+	ConditionVariable readonly_cv; /* signaled when ckpt_started advances */
+
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
@@ -168,6 +171,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void performWALProhibitStateChange(uint32 wal_state);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -332,6 +336,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -342,18 +347,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
-		/*
-		 * If the server is in WAL-Prohibited state then don't do anything until
-		 * someone wakes us up. E.g. a backend might later on request us to put
-		 * the system back to read-write.
-		 */
-		if (IsWALProhibited())
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			performWALProhibitStateChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
 		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
 			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
 							 WAIT_EVENT_CHECKPOINTER_MAIN);
 			continue;
 		}
 
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -891,6 +906,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->readonly_cv);
 	}
 }
 
@@ -1121,6 +1137,94 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	return true;
 }
 
+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		performWALProhibitStateChange(GetWALProhibitState());
+		return;
+	}
+
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
+	for (;;)
+	{
+		/*  We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+		ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
+							   WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+	elog(DEBUG1, "Done WALProhibitRequest");
+}
+
+/*
+ * performWALProhibitStateChange: checkpointer will call this to complete
+ * the requested WAL prohibit state transition.
+ */
+static void
+performWALProhibitStateChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/* Request checkpoint */
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&CheckpointerShmem->readonly_cv);
+}
+
 /*
  * CompactCheckpointerRequestQueue
  *		Remove duplicates from the request queue to avoid backend fsyncs.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc09..2b8b65f5628 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4057,6 +4057,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df744..9594df76946 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 619c33cd780..163fe0d2fce 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -18,4 +18,15 @@
 extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6e98e27b2c8..6555242e3ee 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,7 +324,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
-extern void MakeReadOnlyXLOG(void);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(uint32 new_state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e5382..b32c7723275 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 13872013823..780c59f3e48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -955,6 +955,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..e8271b49f6d 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -35,6 +35,8 @@ extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
+extern void WALProhibitRequest(void);
+
 extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
-- 
2.18.0

#27

tushar

tushar.ahuja@enterprisedb.com

over 5 years ago

In reply to: Amul Sul (#26)

Re: [Patch] ALTER SYSTEM READ ONLY

On 6/22/20 11:59 AM, Amul Sul wrote:

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].

I am not able to perform pg_checksums o/p after shutting down my server
in read only mode .

Steps -

1.initdb (./initdb -k -D data)
2.start the server(./pg_ctl -D data start)
3.connect to psql (./psql postgres)
4.Fire query (alter system read only;)
5.shutdown the server(./pg_ctl -D data stop)
6.pg_checksums

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
pg_checksums: error: cluster must be shut down
[edb@tushar-ldap-docker bin]$

Result - (when server is not in read only)

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
Checksum operation completed
Files scanned: 916
Blocks scanned: 2976
Bad checksums: 0
Data checksum version: 1

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#28

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: tushar (#27)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 24, 2020 at 1:54 PM tushar <tushar.ahuja@enterprisedb.com> wrote:

On 6/22/20 11:59 AM, Amul Sul wrote:

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].

I am not able to perform pg_checksums o/p after shutting down my server
in read only mode .

Steps -

1.initdb (./initdb -k -D data)
2.start the server(./pg_ctl -D data start)
3.connect to psql (./psql postgres)
4.Fire query (alter system read only;)
5.shutdown the server(./pg_ctl -D data stop)
6.pg_checksums

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
pg_checksums: error: cluster must be shut down
[edb@tushar-ldap-docker bin]$

Result - (when server is not in read only)

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
Checksum operation completed
Files scanned: 916
Blocks scanned: 2976
Bad checksums: 0
Data checksum version: 1

I think that's expected since the server isn't clean shutdown, similar error can
be seen with any server which has been shutdown in immediate mode
(pg_clt -D data_dir -m i).

Regards,
Amul

#29

Michael Banck

michael.banck@credativ.de

over 5 years ago

In reply to: tushar (#27)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On Wed, Jun 24, 2020 at 01:54:29PM +0530, tushar wrote:

On 6/22/20 11:59 AM, Amul Sul wrote:

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].

I am not able to perform pg_checksums o/p after shutting down my server in
read onlyï¿½ mode .

Steps -

1.initdb (./initdb -k -D data)
2.start the server(./pg_ctl -D data start)
3.connect to psql (./psql postgres)
4.Fire query (alter system read only;)
5.shutdown the server(./pg_ctl -D data stop)
6.pg_checksums

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
pg_checksums: error: cluster must be shut down
[edb@tushar-ldap-docker bin]$

What's the 'Database cluster state' from pg_controldata at this point?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mï¿½nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mï¿½nchengladbach
Geschï¿½ftsfï¿½hrung: Dr. Michael Meskes, Jï¿½rg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

#30

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Amul Sul (#28)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jun 26, 2020 at 10:11:41AM +0530, Amul Sul wrote:

I think that's expected since the server isn't clean shutdown, similar error can
be seen with any server which has been shutdown in immediate mode
(pg_clt -D data_dir -m i).

Any operation working on on-disk relation blocks needs to have a
consistent state, and a clean shutdown gives this guarantee thanks to
the shutdown checkpoint (see also pg_rewind). There are two states in
the control file, shutdown for a primary and shutdown while in
recovery to cover that. So if you stop the server cleanly but fail to
see a proper state with pg_checksums, it seems to me that the proposed
patch does not handle correctly the state of the cluster in the
control file at shutdown. That's not good.
--
Michael

#31

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Michael Banck (#29)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jun 26, 2020 at 12:15 PM Michael Banck
<michael.banck@credativ.de> wrote:

Hi,

On Wed, Jun 24, 2020 at 01:54:29PM +0530, tushar wrote:

On 6/22/20 11:59 AM, Amul Sul wrote:

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].

I am not able to perform pg_checksums o/p after shutting down my server in
read only mode .

Steps -

1.initdb (./initdb -k -D data)
2.start the server(./pg_ctl -D data start)
3.connect to psql (./psql postgres)
4.Fire query (alter system read only;)
5.shutdown the server(./pg_ctl -D data stop)
6.pg_checksums

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
pg_checksums: error: cluster must be shut down
[edb@tushar-ldap-docker bin]$

What's the 'Database cluster state' from pg_controldata at this point?

"in production"

Regards,
Amul

#32

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Michael Paquier (#30)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jun 26, 2020 at 5:59 AM Michael Paquier <michael@paquier.xyz> wrote:

Any operation working on on-disk relation blocks needs to have a
consistent state, and a clean shutdown gives this guarantee thanks to
the shutdown checkpoint (see also pg_rewind). There are two states in
the control file, shutdown for a primary and shutdown while in
recovery to cover that. So if you stop the server cleanly but fail to
see a proper state with pg_checksums, it seems to me that the proposed
patch does not handle correctly the state of the cluster in the
control file at shutdown. That's not good.

I think it is actually very good. If a feature that supposedly
prevents writing WAL permitted a shutdown checkpoint to be written, it
would be failing to accomplish its design goal. There is not much of a
use case for a feature that stops WAL from being written except when
it doesn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#32)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is a rebased version for the latest master head[1].

Regards,
Amul

1] Commit # 101f903e51f52bf595cd8177d2e0bc6fe9000762

Attachments:

v3-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/octet-stream; name=v3-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From 94b2f05c121ea4bb3198dbfcc34bd521f5902acc Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v3 1/6] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ece..13648887187 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v3-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchapplication/octet-stream; name=v3-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchDownload

From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 May 2020 06:39:43 -0400
Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or
 READ-WRITE

Till the previous commit, the backend used to do this, but now the backend
requests checkpointer to do it. Checkpointer, noticing that the current state
is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
and then acknowledges back to the backend who requested the state change.

Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
prohibited state persistent across the system restarts.
---
 src/backend/access/transam/walprohibit.c |  26 ++++-
 src/backend/access/transam/xlog.c        |  77 +++++++++++++--
 src/backend/postmaster/checkpointer.c    | 116 +++++++++++++++++++++--
 src/backend/postmaster/pgstat.c          |   3 +
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  11 +++
 src/include/access/xlog.h                |   3 +-
 src/include/catalog/pg_control.h         |   3 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 10 files changed, 222 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index df97596ddf9..a8cda2fafbc 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -30,6 +30,8 @@ ProcessBarrierWALProhibit(void)
 	 */
 	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
 	{
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
 		/*
 		 * XXX: Kill off the whole session by throwing FATAL instead of killing
 		 * transaction by throwing ERROR due to following reasons that need be
@@ -64,6 +66,8 @@ ProcessBarrierWALProhibit(void)
 void
 AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
 {
+	uint32			state;
+
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
@@ -72,10 +76,22 @@ AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
 	/* Alter WAL prohibit state not allowed during recovery */
 	PreventCommandDuringRecovery("ALTER SYSTEM");
 
-	/* Yet to add ALTER SYTEM READ WRITE support */
-	if (!stmt->WALProhibited)
-		elog(ERROR, "XXX: Yet to implement");
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return; /* server is already in the desired state */
 
-	MakeReadOnlyXLOG();
-	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT));
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	WALProhibitRequest();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c2d6d19716c..64cc347caa2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -662,10 +663,10 @@ typedef struct XLogCtlData
 	RecoveryState SharedRecoveryState;
 
 	/*
-	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * SharedWALProhibitState indicates current WAL prohibit state.
 	 * Protected by info_lck.
 	 */
-	bool		WALProhibited;
+	uint32		SharedWALProhibitState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -7713,6 +7714,15 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion update WAL prohibit state in shared memory
+	 * that will decide the further WAL insert should be allowed or not.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = ControlFile->wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7723,7 +7733,15 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7969,12 +7987,54 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
-void
-MakeReadOnlyXLOG(void)
+/* Atomically return the current server WAL prohibited state */
+uint32
+GetWALProhibitState(void)
 {
+	uint32		state;
+
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->WALProhibited = true;
+	state = XLogCtl->SharedWALProhibitState;
 	SpinLockRelease(&XLogCtl->info_lck);
+
+	return state;
+}
+
+/*
+ * SetWALProhibitState: Change current wal prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Update new state in share memory */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = new_state;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Update control file if it is the final state */
+	if (!(new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		bool	wal_prohibited = (new_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->wal_prohibited = wal_prohibited;
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	return true;
 }
 
 /*
@@ -7983,9 +8043,7 @@ MakeReadOnlyXLOG(void)
 bool
 IsWALProhibited(void)
 {
-	volatile XLogCtlData *xlogctl = XLogCtl;
-
-	return xlogctl->WALProhibited;
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
 }
 
 /*
@@ -8253,7 +8311,6 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
-	Assert(!IsWALProhibited());
 
 	LocalXLogInsertAllowed = 1;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e5e56d4eec..bbe62b73a08 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -127,6 +128,8 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
+	ConditionVariable readonly_cv; /* signaled when ckpt_started advances */
+
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
@@ -168,6 +171,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void performWALProhibitStateChange(uint32 wal_state);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -332,6 +336,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -342,18 +347,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
-		/*
-		 * If the server is in WAL-Prohibited state then don't do anything until
-		 * someone wakes us up. E.g. a backend might later on request us to put
-		 * the system back to read-write.
-		 */
-		if (IsWALProhibited())
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			performWALProhibitStateChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
 		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
 			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
 							 WAIT_EVENT_CHECKPOINTER_MAIN);
 			continue;
 		}
 
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -891,6 +906,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->readonly_cv);
 	}
 }
 
@@ -1121,6 +1137,94 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	return true;
 }
 
+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		performWALProhibitStateChange(GetWALProhibitState());
+		return;
+	}
+
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
+	for (;;)
+	{
+		/*  We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+		ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
+							   WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+	elog(DEBUG1, "Done WALProhibitRequest");
+}
+
+/*
+ * performWALProhibitStateChange: checkpointer will call this to complete
+ * the requested WAL prohibit state transition.
+ */
+static void
+performWALProhibitStateChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/* Request checkpoint */
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&CheckpointerShmem->readonly_cv);
+}
+
 /*
  * CompactCheckpointerRequestQueue
  *		Remove duplicates from the request queue to avoid backend fsyncs.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2da2c..0eb40a86b52 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4057,6 +4057,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df744..9594df76946 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 619c33cd780..163fe0d2fce 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -18,4 +18,15 @@
 extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 3578da2f420..3f9e96cd18e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -326,7 +326,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
-extern void MakeReadOnlyXLOG(void);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(uint32 new_state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e5382..b32c7723275 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 13872013823..780c59f3e48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -955,6 +955,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..e8271b49f6d 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -35,6 +35,8 @@ extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
+extern void WALProhibitRequest(void);
+
 extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
-- 
2.22.0

v3-0002-Add-alter-system-read-only-write-syntax.patchapplication/octet-stream; name=v3-0002-Add-alter-system-read-only-write-syntax.patchDownload

From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v3 2/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 8 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de664..ba3393b8ccf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(WALProhibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5406,6 +5415,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e3f33c40be5..b09bff458af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(WALProhibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3458,6 +3464,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dbb47d49829..6090d18ec61 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10172,8 +10173,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->WALProhibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9b0c376c8cb..7af96c77082 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2772,6 +2779,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3636,3 +3644,15 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index eb018854a5c..d586cf74816 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1858,9 +1858,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 151bcdb7ef5..f2c1ae8e3fe 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3194,6 +3194,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7eaaad1e140..247bdf1bacc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v3-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/octet-stream; name=v3-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
    PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
    barrier has been absorbed by all the backends.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction which yet to get XID
    assigned we don't need to do anything special, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (from existing or new backend) starts as a read-only
    transaction.

 5. Auxiliary processes like autovacuum launcher, background writer,
    checkpointer and  walwriter will don't do anything in WAL-Prohibited
    server state until someone wakes us up. E.g. a backend might later on
    request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well)

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. Only super user can toggle WAL-Prohibit state.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |  1 +
 src/backend/access/transam/walprohibit.c | 81 ++++++++++++++++++++++++
 src/backend/access/transam/xact.c        | 49 ++++++++------
 src/backend/access/transam/xlog.c        | 72 ++++++++++++++++++---
 src/backend/postmaster/autovacuum.c      |  4 ++
 src/backend/postmaster/bgwriter.c        |  2 +-
 src/backend/postmaster/checkpointer.c    | 12 ++++
 src/backend/storage/ipc/procsignal.c     | 26 ++------
 src/backend/tcop/utility.c               | 14 +---
 src/backend/utils/misc/guc.c             | 26 ++++++++
 src/include/access/walprohibit.h         | 21 ++++++
 src/include/access/xlog.h                |  3 +
 src/include/storage/procsignal.h         |  7 +-
 13 files changed, 246 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..df97596ddf9
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "postmaster/bgwriter.h"
+#include "storage/procsignal.h"
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of killing
+		 * transaction by throwing ERROR due to following reasons that need be
+		 * thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Cannot continue a transaction if it has performed writes while system is read only.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Yet to add ALTER SYTEM READ WRITE support */
+	if (!stmt->WALProhibited)
+		elog(ERROR, "XXX: Yet to implement");
+
+	MakeReadOnlyXLOG();
+	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT));
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa7ea0..4fbfcdbb965 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1937,23 +1937,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
@@ -4875,9 +4880,11 @@ CommitSubTransaction(void)
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
-	 * think it should leave the child state in place.
+	 * think it should leave the child state in place.  Note that the upper
+	 * transaction will be a force to ready-only irrespective of its previous
+	 * status if the server state is WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
@@ -5033,9 +5040,11 @@ AbortSubTransaction(void)
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
-	 * with the commit case.
+	 * with the commit case.  Note that the upper transaction will be a force
+	 * to ready-only irrespective of its previous status if the server state is
+	 * WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	RESUME_INTERRUPTS();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0a97b1d37fb..c2d6d19716c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -245,9 +245,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -660,6 +661,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * Protected by info_lck.
+	 */
+	bool		WALProhibited;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -7962,6 +7969,25 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+void
+MakeReadOnlyXLOG(void)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->WALProhibited = true;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->WALProhibited;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8177,9 +8203,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8193,14 +8219,25 @@ XLogInsertAllowed(void)
 		return (bool) LocalXLogInsertAllowed;
 
 	/*
-	 * Else, must check to see if we're still in recovery.
+	 * Else, must check to see if we're still in recovery
 	 */
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8216,12 +8253,20 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
+	Assert(!IsWALProhibited());
+
 	LocalXLogInsertAllowed = 1;
 
 	/* Initialize as RecoveryInProgress() would do when switching state */
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8513,7 +8558,10 @@ ShutdownXLOG(int code, Datum arg)
 
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	/*
+	 * Can't perform checkpoint or xlog rotation without writing WAL.
+	 */
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8526,6 +8574,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..f83f86994db 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427f..6c6ff7dc3af 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -268,7 +268,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b80..5e5e56d4eec 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -342,6 +342,18 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		/*
+		 * If the server is in WAL-Prohibited state then don't do anything until
+		 * someone wakes us up. E.g. a backend might later on request us to put
+		 * the system back to read-write.
+		 */
+		if (IsWALProhibited())
+		{
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 13648887187..b973727a580 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 7af96c77082..d6411e4f3e9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3644,15 +3644,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	/* some code */
-	elog(INFO, "AlterSystemSetWALProhibitState() called");
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f0..3384523fe48 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -221,6 +221,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -610,6 +611,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2041,6 +2043,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -11998,4 +12012,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..619c33cd780
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,21 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+
+#endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b143348879..3578da2f420 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -300,11 +300,13 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool IsWALProhibited(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -324,6 +326,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void MakeReadOnlyXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
-- 
2.22.0

v3-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/octet-stream; name=v3-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:10:55 -0400
Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 16 ++++++++
 src/backend/access/brin/brin_revmap.c     |  8 ++++
 src/backend/access/gin/ginbtree.c         | 17 ++++++--
 src/backend/access/gin/gindatapage.c      | 14 ++++++-
 src/backend/access/gin/ginfast.c          |  8 ++++
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          | 10 ++++-
 src/backend/access/gin/ginvacuum.c        |  9 +++++
 src/backend/access/gist/gist.c            | 16 ++++++++
 src/backend/access/gist/gistvacuum.c      |  9 +++++
 src/backend/access/hash/hash.c            | 13 ++++++
 src/backend/access/hash/hashinsert.c      |  8 ++++
 src/backend/access/hash/hashovfl.c        | 14 +++++++
 src/backend/access/hash/hashpage.c        | 13 ++++++
 src/backend/access/heap/heapam.c          | 32 +++++++++++++++
 src/backend/access/heap/pruneheap.c       |  9 ++++-
 src/backend/access/heap/vacuumlazy.c      | 13 ++++++
 src/backend/access/heap/visibilitymap.c   | 20 ++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  4 ++
 src/backend/access/nbtree/nbtinsert.c     | 13 +++++-
 src/backend/access/nbtree/nbtpage.c       | 24 +++++++++++
 src/backend/access/spgist/spgdoinsert.c   | 19 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 13 ++++++
 src/backend/access/transam/multixact.c    |  6 ++-
 src/backend/access/transam/twophase.c     | 10 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  7 ++++
 src/backend/access/transam/xlog.c         | 27 +++++++++----
 src/backend/access/transam/xloginsert.c   | 13 +++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 +++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 11 ++++-
 src/backend/storage/lmgr/lock.c           |  6 +--
 src/backend/utils/cache/relmapper.c       |  4 ++
 src/include/access/walprohibit.h          | 49 ++++++++++++++++++++++-
 src/include/miscadmin.h                   | 27 +++++++++++++
 40 files changed, 491 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0c..ef002a51773 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -758,6 +759,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..197e1213137 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
@@ -240,6 +245,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(idxrel))
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -881,6 +894,9 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index e8b8308f82e..b2d286404a2 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -397,6 +398,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(idxrel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -606,6 +611,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8d08b05f515..1b835b3000b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -333,6 +334,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -378,6 +380,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -386,10 +389,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -410,7 +417,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -548,6 +555,10 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -588,7 +599,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..226cb3ce44b 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -836,7 +837,11 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 		}
 
 		if (RelationNeedsWAL(indexrel))
+		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -1777,6 +1782,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1831,18 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..d7781de7674 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,9 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -587,7 +591,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * critical section.
 		 */
 		if (RelationNeedsWAL(index))
+		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..d957aa6e582 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedbc..938089238da 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,19 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..36a884af597 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -159,6 +160,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(gvs->index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -650,6 +655,10 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d62..8f6b15d8ee4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -134,6 +135,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -467,6 +471,10 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (!is_build && RelationNeedsWAL(rel))
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -525,6 +533,10 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -1665,6 +1677,10 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 	if (ndeletable > 0)
 	{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..ccf9bc0c214 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -341,6 +342,10 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(rel))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -634,6 +639,10 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(info->index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e77..1d3f4c92f19 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -572,6 +573,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -787,6 +792,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(rel))
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -882,6 +891,10 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..360e30456fe 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,9 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -370,6 +374,10 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..5abba14899e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,9 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -577,6 +581,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	if (RelationNeedsWAL(rel))
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -929,7 +937,13 @@ readpage:
 					 * WAL for that.
 					 */
 					if (RelationNeedsWAL(rel))
+					{
+						/*
+						 * Can reach here from VACUUM, so need not have an XID
+						 */
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..faad58297d2 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,9 @@ restart_expand:
 		goto fail;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1176,9 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+					AssertWALPermitted_HaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1230,9 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+			AssertWALPermitted_HaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1279,9 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4cd46a..f8cafc378ff 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -46,6 +46,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1870,6 +1871,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2143,6 +2147,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2661,6 +2668,9 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3413,6 +3423,9 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3586,6 +3599,9 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4519,6 +4535,9 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5310,6 +5329,9 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5468,6 +5490,9 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5576,6 +5601,9 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5722,6 +5750,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(relation))
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 256df4de105..90f43cbcb9b 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -77,11 +78,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	TransactionId OldestXmin;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -225,6 +226,10 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 									 &prstate);
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(relation))
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f75..2e5202d8068 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -1203,6 +1204,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				/* Can reach here from VACUUM, so need not have an XID */
+				if (RelationNeedsWAL(onerel))
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1471,6 +1476,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			/* Can reach here from VACUUM, so need not have an XID */
+			if (RelationNeedsWAL(onerel))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1917,6 +1926,10 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(onerel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40d..30d1d6f34c7 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -270,6 +271,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (RelationNeedsWAL(rel))
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -476,6 +487,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +501,14 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..a471a4b7ff5 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,9 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b86c122763e..3527c9f4183 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1900,13 +1904,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2477,9 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 75628e0eb98..09b45fbb559 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -201,6 +202,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	LockBuffer(metabuf, BT_WRITE);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -376,6 +381,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(rel))
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -1068,6 +1077,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1195,6 +1208,9 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1812,6 +1828,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2168,6 +2188,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(rel))
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..003b5e80f21 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,9 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +462,9 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1116,9 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1527,9 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1616,9 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1804,9 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..39bace9e490 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -323,6 +324,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -447,6 +452,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -505,6 +514,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	if (RelationNeedsWAL(index))
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ce84dac0c40..2b7b2ccad31 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,9 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	/* Can reach here from VACUUM, so need not have an XID */
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2942,7 +2946,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0ec..0fa01f241f3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1112,6 +1113,9 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	/* Recording transaction prepares, so we'll have an XID */
+	AssertWALPermitted_HaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2204,6 +2208,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2294,6 +2301,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e3..365de44321d 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -73,6 +74,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* Cannot assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextFullXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index a8cda2fafbc..896f0917cef 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -16,6 +16,16 @@
 #include "postmaster/bgwriter.h"
 #include "storage/procsignal.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermitted_HaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * ProcessBarrierWALProhibit()
  *
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4fbfcdbb965..02830cbf85d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1292,6 +1293,9 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		/* We'll be reaching here with valid XID only. */
+		AssertWALPermitted_HaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1652,6 +1656,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermitted_HaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 64cc347caa2..d5d184b3a50 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1027,7 +1027,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2862,9 +2862,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8836,6 +8838,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8865,6 +8869,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9093,6 +9099,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9250,6 +9258,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9890,7 +9900,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9904,10 +9914,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9929,8 +9939,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09eb..d69f6ca427a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -124,9 +125,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -204,6 +210,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..f961178b358 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermitted_HaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bbe62b73a08..7e3882a055c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -934,6 +934,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 29c920800a6..ba74ddcd249 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3603,13 +3603,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 95a21f6cc38..5faa69fabb9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,20 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +314,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79bd..212312d5ae5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..ec48073bbf1 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,9 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		/* Must be performing an INSERT or UPDATE, so we'll have an XID */
+		AssertWALPermitted_HaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 163fe0d2fce..1adcfc571d6 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -19,8 +19,8 @@ extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
@@ -29,4 +29,49 @@ extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
  */
 #define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermitted_HaveXID(void)
+{
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * then it won't be killed while changing the system state to WAL prohibited.
+ * Therefore, we need to explicitly error out before entering into the critical
+ * section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 18bc8a7b904..63459305383 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

v3-0006-Documentation-WIP.patchapplication/octet-stream; name=v3-0006-Documentation-WIP.patchDownload

From 75d3fcea66b1f6acee65bfd8506fef67c2152f62 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v3 6/6] Documentation - WIP

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 59 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 60 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..f62929f1660 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -433,8 +433,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -477,6 +477,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -522,7 +570,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -630,8 +679,8 @@ If the buffer is clean and checksums are in use then
 MarkBufferDirtyHint() inserts an XLOG_FPI record to ensure that we
 take a full page image that includes the hint. We do this to avoid
 a partial page write, when we write the dirtied page. WAL is not
-written during recovery, so we simply skip dirtying blocks because
-of hints when in recovery.
+written while in read only (i.e. during recovery or in WAL prohibit state), so
+we simply skip dirtying blocks because of hints when in read only.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

#34

Prabhat Sahu

prabhat.sahu@enterprisedb.com

over 5 years ago

In reply to: Amul Sul (#33)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi All,
I was testing the feature on top of v3 patch and found the "pg_upgrade"
failure after keeping "alter system read only;" as below:

-- Steps:
./initdb -D data
./pg_ctl -D data -l logs start -c
./psql postgres
alter system read only;
\q
./pg_ctl -D data -l logs stop -c

./initdb -D data2
./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520

[edb@localhost bin]$ ./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520
Performing Consistency Checks
-----------------------------
Checking cluster versions ok

The source cluster was not shut down cleanly.
Failure, exiting

--Below is the logs
2021-07-16 11:04:20.305 IST [105788] LOG: starting PostgreSQL 14devel on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat
4.8.5-39), 64-bit
2020-07-16 11:04:20.309 IST [105788] LOG: listening on IPv6 address "::1",
port 5432
2020-07-16 11:04:20.309 IST [105788] LOG: listening on IPv4 address
"127.0.0.1", port 5432
2020-07-16 11:04:20.321 IST [105788] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5432"
2020-07-16 11:04:20.347 IST [105789] LOG: database system was shut down at
2020-07-16 11:04:20 IST
2020-07-16 11:04:20.352 IST [105788] LOG: database system is ready to
accept connections
2020-07-16 11:04:20.534 IST [105790] LOG: system is now read only
2020-07-16 11:04:20.542 IST [105788] LOG: received fast shutdown request
2020-07-16 11:04:20.543 IST [105788] LOG: aborting any active transactions
2020-07-16 11:04:20.544 IST [105788] LOG: background worker "logical
replication launcher" (PID 105795) exited with exit code 1
2020-07-16 11:04:20.544 IST [105790] LOG: shutting down
2020-07-16 11:04:20.544 IST [105790] LOG: skipping shutdown checkpoint
because the system is read only
2020-07-16 11:04:20.551 IST [105788] LOG: database system is shut down

On Tue, Jul 14, 2020 at 12:08 PM Amul Sul <sulamul@gmail.com> wrote:

Attached is a rebased version for the latest master head[1].

Regards,
Amul

1] Commit # 101f903e51f52bf595cd8177d2e0bc6fe9000762

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

#35

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Prabhat Sahu (#34)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 16, 2020 at 2:12 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com>
wrote:

Hi All,
I was testing the feature on top of v3 patch and found the "pg_upgrade"
failure after keeping "alter system read only;" as below:

That's expected. You can't perform a clean shutdown without writing WAL.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Robert Haas (#35)

Re: [Patch] ALTER SYSTEM READ ONLY

Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Regards,

Soumyadeep (VMware)

#37

SATYANARAYANA NARLAPURAM

satyanarlapuram@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#36)

Re: [Patch] ALTER SYSTEM READ ONLY

+1 to this feature and I have been thinking about it for sometime. There
are several use cases with marking database read only (no transaction log
generation). Some of the examples in a hosted service scenario are 1/ when
customer runs out of storage space, 2/ Upgrading the server to a different
major version (current server can be set to read only, new one can be built
and then switch DNS), 3/ If user wants to force a database to read only and
not accept writes, may be for import / export a database.

Thanks,
Satya

On Wed, Jul 22, 2020 at 3:04 PM Soumyadeep Chakraborty <
soumyadeep2007@gmail.com> wrote:

Show quoted text

Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Regards,

Soumyadeep (VMware)

#38

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: amul sul (#1)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi Amul,

On Tue, Jun 16, 2020 at 6:56 AM amul sul <sulamul@gmail.com> wrote:

The proposed feature is built atop of super barrier mechanism commit[1] to
coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and
WAL
permitted state respectively. When the checkpointer process sees the WAL
prohibit
state change request, it emits a global barrier and waits until all
backends that
participate in the ProcSignal absorbs it.

Why should the checkpointer have the responsibility of setting the state
of the system to read-only? Maybe this should be the postmaster's
responsibility - the checkpointer should just handle requests to
checkpoint. I think the backend requesting the read-only transition
should signal the postmaster, which in turn, will take on the aforesaid
responsibilities. The postmaster, could also additionally request a
checkpoint, using RequestCheckpoint() (if we want to support the
read-onlyness discussed in [1]/messages/by-id/CAE-ML+-zdWODAyWNs_Eu-siPxp_3PGbPkiSg=toLeW9iS_eioA@mail.gmail.com). checkpointer.c should not be touched by
this feature.

Following on, any condition variable used by the backend to wait for the
ALTER SYSTEM command to finish (the patch uses
CheckpointerShmem->readonly_cv), could be housed in ProcGlobal.

Regards,
Soumyadeep (VMware)

[1]: /messages/by-id/CAE-ML+-zdWODAyWNs_Eu-siPxp_3PGbPkiSg=toLeW9iS_eioA@mail.gmail.com

#39

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#36)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 3:33 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

True.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

Read-only is for the queries.

The aim of this feature is preventing new WAL records from being generated, not
preventing them from being flushed to disk, or streamed to standbys, or anything
else. The rest should happen as normal.

If you can't flush WAL, then you might not be able to evict some number of
buffers, which in the worst case could be large. That's because you can't evict
a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise,
you wouldn't be following the WAL-before-data rule). And having a potentially
large number of unevictable buffers around sounds terrible, not only for
performance, but also for having the system keep working at all.

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

The intention is to change the system to read-only ASAP; the checkpoint will
make it much slower.

I don't think we can skip control file updates that need to make read-only
state persistent across the restart.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Sure, thanks for the suggestions. Syntax change is not a harder part; we can
choose the better one later.

Regards,
Amul

#40

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: SATYANARAYANA NARLAPURAM (#37)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 4:34 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

+1 to this feature and I have been thinking about it for sometime. There are several use cases with marking database read only (no transaction log generation). Some of the examples in a hosted service scenario are 1/ when customer runs out of storage space, 2/ Upgrading the server to a different major version (current server can be set to read only, new one can be built and then switch DNS), 3/ If user wants to force a database to read only and not accept writes, may be for import / export a database.

Thanks for voting & listing the realistic use cases.

Regards,
Amul

#41

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#38)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 6:08 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

Hi Amul,

Thanks, Soumyadeep for looking and putting your thoughts on the patch.

On Tue, Jun 16, 2020 at 6:56 AM amul sul <sulamul@gmail.com> wrote:

The proposed feature is built atop of super barrier mechanism commit[1] to
coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and
WAL
permitted state respectively. When the checkpointer process sees the WAL
prohibit
state change request, it emits a global barrier and waits until all
backends that
participate in the ProcSignal absorbs it.

Why should the checkpointer have the responsibility of setting the state
of the system to read-only? Maybe this should be the postmaster's
responsibility - the checkpointer should just handle requests to
checkpoint.

Well, once we've initiated the change to a read-only state, we probably want to
always either finish that change or go back to read-write, even if the process
that initiated the change is interrupted. Leaving the system in a
half-way-in-between state long term seems bad. Maybe we would have put some
background process, but choose the checkpointer in charge of making the state
change and to avoid the new background process to keep the first version patch
simple. The checkpointer isn't likely to get killed, but if it does, it will
be relaunched and the new one can clean things up. On the other hand, I agree
making the checkpointer responsible for more than one thing might not
be a good idea
but I don't think the postmaster should do the work that any
background process can
do.

I think the backend requesting the read-only transition
should signal the postmaster, which in turn, will take on the aforesaid
responsibilities. The postmaster, could also additionally request a
checkpoint, using RequestCheckpoint() (if we want to support the
read-onlyness discussed in [1]). checkpointer.c should not be touched by
this feature.

Following on, any condition variable used by the backend to wait for the
ALTER SYSTEM command to finish (the patch uses
CheckpointerShmem->readonly_cv), could be housed in ProcGlobal.

Relevant only if we don't want to use the checkpointer process.

Regards,
Amul

#42

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Amul Sul (#39)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 3:42 AM Amul Sul <sulamul@gmail.com> wrote:

The aim of this feature is preventing new WAL records from being generated, not
preventing them from being flushed to disk, or streamed to standbys, or anything
else. The rest should happen as normal.

If you can't flush WAL, then you might not be able to evict some number of
buffers, which in the worst case could be large. That's because you can't evict
a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise,
you wouldn't be following the WAL-before-data rule). And having a potentially
large number of unevictable buffers around sounds terrible, not only for
performance, but also for having the system keep working at all.

In the read-only level I was suggesting, I wasn't suggesting that we
stop WAL flushes, in fact we should flush the WAL before we mark the
system as read-only. Once the system declares itself as read-only, it
will not perform any more on-disk changes; It may perform all the
flushes it needs as a part of the read-only request handling.

WAL should still stream to the secondary of course, even after you mark
the primary as read-only.

Read-only is for the queries.

What I am saying is it doesn't have to be just the queries. I think we
can cater to all the other use cases simply by forcing a checkpoint
before marking the system as read-only.

The intention is to change the system to read-only ASAP; the checkpoint will
make it much slower.

I agree - if one needs that speed, then they can do the equivalent of:
ALTER SYSTEM SET read_only to 'wal';
and the expensive checkpoint you mentioned can be avoided.

I don't think we can skip control file updates that need to make read-only
state persistent across the restart.

I was referring to control file updates post the read-only state change.
Any updates done as a part of the state change is totally cool.

Regards,
Soumyadeep (VMware)

#43

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Amul Sul (#41)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 3:57 AM Amul Sul <sulamul@gmail.com> wrote:

Well, once we've initiated the change to a read-only state, we probably want to
always either finish that change or go back to read-write, even if the process
that initiated the change is interrupted. Leaving the system in a
half-way-in-between state long term seems bad. Maybe we would have put some
background process, but choose the checkpointer in charge of making the state
change and to avoid the new background process to keep the first version patch
simple. The checkpointer isn't likely to get killed, but if it does, it will
be relaunched and the new one can clean things up. On the other hand, I agree
making the checkpointer responsible for more than one thing might not
be a good idea
but I don't think the postmaster should do the work that any
background process can
do.

+1 for doing it in a background process rather than in the backend
itself (as we can't risk doing it in a backend as it can crash and won't
restart and clean up as a background process would).

As my co-worker pointed out to me, doing the work in the postmaster is a
very bad idea as we don't want delays in serving connection requests on
account of the barrier that comes with this patch.

I would like to see this responsibility in a separate auxiliary process
but I guess having it in the checkpointer isn't the end of the world.

Regards,
Soumyadeep (VMware)

#44

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Robert Haas (#21)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think we'd want the FIRST write operation to be the end-of-recovery
checkpoint, before the system is fully read-write. And then after that
completes you could do other things.

I can't see why this is necessary from a correctness or performance
point of view. Maybe I'm missing something.

In case it is necessary, the patch set does not wait for the checkpoint to
complete before marking the system as read-write. Refer:

/* Set final state by clearing in-progress flag bit */
if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
{
if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
ereport(LOG, (errmsg("system is now read only")));
else
{
/* Request checkpoint */
RequestCheckpoint(CHECKPOINT_IMMEDIATE);
ereport(LOG, (errmsg("system is now read write")));
}
}

We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
we SetWALProhibitState() and do the ereport(), if we have a read-write
state change request.

Also, we currently request this checkpoint even if there was no startup
recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
the read-write request does follow a startup recovery.
So it should really be:
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
CHECKPOINT_END_OF_RECOVERY);
We would need to convey that an end-of-recovery-checkpoint is pending in
shmem somehow (and only if one such checkpoint is pending, should we do
it as a part of the read-write request handling).
Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
/*
* Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
*/
and then check for that.

Some minor comments about the code (some of them probably doesn't
warrant immediate attention, but for the record...):

1. There are some places where we can use a local variable to store the
result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
brin_doupdate()

2. Similarly, we can also capture the calls to GetWALProhibitState() in
a local variable where applicable. E.g. inside WALProhibitRequest().

3. Some of the functions that were added such as GetWALProhibitState(),
IsWALProhibited() etc could be declared static inline.

4. IsWALProhibited(): Shouldn't it really be:
bool
IsWALProhibited(void)
{
uint32 walProhibitState = GetWALProhibitState();
return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
&& (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
}

5. I think the comments:
/* Must be performing an INSERT or UPDATE, so we'll have an XID */
and
/* Can reach here from VACUUM, so need not have an XID */
can be internalized in the function/macro comment header.

6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
advances */
We need to update the comment here.

Regards,
Soumyadeep (VMware)

#45

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Amul Sul (#33)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v3 2/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
src/backend/nodes/copyfuncs.c | 12 ++++++++++++
src/backend/nodes/equalfuncs.c | 9 +++++++++
src/backend/parser/gram.y | 13 +++++++++++++
src/backend/tcop/utility.c | 20 ++++++++++++++++++++
src/bin/psql/tab-complete.c | 6 ++++--
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++++++++++
src/tools/pgindent/typedefs.list | 1 +
8 files changed, 70 insertions(+), 2 deletions(-)

Shouldn't there be at outfuncs support as well? Perhaps we even need
readfuncs, not immediately sure.

From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

1. When a user tried to change server state to WAL-Prohibited using
ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
barrier has been absorbed by all the backends.

2. When a backend receives the WAL-Prohibited barrier, at that moment if
it is already in a transaction and the transaction already assigned XID,
then the backend will be killed by throwing FATAL(XXX: need more discussion
on this)

I think we should consider introducing XACTFATAL or such, guaranteeing
the transaction gets aborted, without requiring a FATAL. This has been
needed for enough cases that it's worthwhile.

There are several cases where we WAL log without having an xid
assigned. E.g. when HOT pruning during syscache lookups or such. Are
there any cases where the check for being in recovery is followed by a
CHECK_FOR_INTERRUPTS, before the WAL logging is done?

3. Otherwise, if that backend running transaction which yet to get XID
assigned we don't need to do anything special, simply call
ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
XLogInsertAllowed() first which set ready only state appropriately.

4. A new transaction (from existing or new backend) starts as a read-only
transaction.

Why do we need 4)? And doesn't that have the potential to be
unnecessarily problematic if a the server is subsequently brought out of
the readonly state again?

5. Auxiliary processes like autovacuum launcher, background writer,
checkpointer and walwriter will don't do anything in WAL-Prohibited
server state until someone wakes us up. E.g. a backend might later on
request us to put the system back to read-write.

Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do
anything in this state. bgwriter for example is even running entirely
normally in a hot standby node?

6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
and xlog rotation. Starting up again will perform crash recovery(XXX:
need some discussion on this as well)

7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

8. Only super user can toggle WAL-Prohibit state.

9. Add system_is_read_only GUC show the system state -- will true when system
is wal prohibited or in recovery.

+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));

ISTM we should rather do this in a GRANTable manner. We've worked
substantially towards that in the last few years.

+	/*
+	 * WALProhibited indicates if we have stopped allowing WAL writes.
+	 * Protected by info_lck.
+	 */
+	bool		WALProhibited;
+
/*
* SharedHotStandbyActive indicates if we allow hot standby queries to be
* run.  Protected by info_lck.
@@ -7962,6 +7969,25 @@ StartupXLOG(void)
RequestCheckpoint(CHECKPOINT_FORCE);
}

+void
+MakeReadOnlyXLOG(void)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->WALProhibited = true;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->WALProhibited;
+}

What does this kind of locking achieving? It doesn't protect against
concurrent ALTER SYSTEM SET READ ONLY or such?

+		/*
+		 * If the server is in WAL-Prohibited state then don't do anything until
+		 * someone wakes us up. E.g. a backend might later on request us to put
+		 * the system back to read-write.
+		 */
+		if (IsWALProhibited())
+		{
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
/*
* Detect a pending checkpoint request by checking whether the flags
* word in shared memory is nonzero.  We shouldn't need to acquire the

So if the ASRO happens while a checkpoint, potentially with a
checkpoint_timeout = 60d, it'll not take effect until the checkpoint has
finished.

But uh, as far as I can tell, the code would simply continue an
in-progress checkpoint, despite having absorbed the barrier. And then
we'd PANIC when doing the XLogInsert()?

diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..619c33cd780
--- /dev/null
+++ b/src/include/access/walprohibit.h

Not sure I like the mix of xlog/wal prefix for pretty closely related
files... I'm not convinced it's worth having a separate file for this,
fwiw.

From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 May 2020 06:39:43 -0400
Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or
READ-WRITE

Till the previous commit, the backend used to do this, but now the backend
requests checkpointer to do it. Checkpointer, noticing that the current state
is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
and then acknowledges back to the backend who requested the state change.

Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
prohibited state persistent across the system restarts.

The split between the previous commit and this commit seems more
confusing than useful to me.

+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		performWALProhibitStateChange(GetWALProhibitState());
+		return;
+	}
+
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
+	for (;;)
+	{
+		/*  We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+		ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
+							   WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+	elog(DEBUG1, "Done WALProhibitRequest");
+}

Isn't it possible that the system could have been changed back to be
read-write by the time the wakeup is being processed?

From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:10:55 -0400
Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL
write

Isn't that the wrong order? This needs to come before the feature is
enabled, no?

@@ -758,6 +759,9 @@ brinbuildempty(Relation index)
ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	/* Building indexes will have an XID */
+	AssertWALPermitted_HaveXID();
+

Ugh, that's a pretty ugly naming scheme mix.

@@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
brin_can_do_samepage_update(oldbuf, origsz, newsz))
{
+		/* Can reach here from VACUUM, so need not have an XID */
+		if (RelationNeedsWAL(idxrel))
+			CheckWALPermitted();
+

Hm. Maybe I am confused, but why is that dependent on
RelationNeedsWAL()? Shouldn't readonly actually mean readonly, even if
no WAL is emitted?

#include "access/genam.h"
#include "access/gist_private.h"
#include "access/transam.h"
+#include "access/walprohibit.h"
#include "commands/vacuum.h"
#include "lib/integerset.h"
#include "miscadmin.h"

The number of places that now need this new header - pretty much the
same set of files that do XLogInsert, already requiring an xlog* header
to be included - drives me further towards the conclusion that it's not
a good idea to have it separate.

extern void ProcessInterrupts(void);

+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+

Why are these in headers? And why is this tied to CritSectionCount?

Greetings,

Andres Freund

#46

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#44)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <
soumyadeep2007@gmail.com> wrote:

On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think we'd want the FIRST write operation to be the end-of-recovery
checkpoint, before the system is fully read-write. And then after that
completes you could do other things.

I can't see why this is necessary from a correctness or performance
point of view. Maybe I'm missing something.

In case it is necessary, the patch set does not wait for the checkpoint to
complete before marking the system as read-write. Refer:

/* Set final state by clearing in-progress flag bit */
if (SetWALProhibitState(wal_state &

~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))

{
if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
ereport(LOG, (errmsg("system is now read only")));
else
{
/* Request checkpoint */
RequestCheckpoint(CHECKPOINT_IMMEDIATE);
ereport(LOG, (errmsg("system is now read write")));
}
}

We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
we SetWALProhibitState() and do the ereport(), if we have a read-write
state change request.

+1, I too have the same question.

FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
think
it will be deadlock case -- checkpointer process waiting for itself.

Also, we currently request this checkpoint even if there was no startup
recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
the read-write request does follow a startup recovery.
So it should really be:
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
CHECKPOINT_END_OF_RECOVERY);
We would need to convey that an end-of-recovery-checkpoint is pending in
shmem somehow (and only if one such checkpoint is pending, should we do
it as a part of the read-write request handling).
Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
/*
* Skip end-of-recovery checkpoint if the system is in WAL prohibited

state.

*/
and then check for that.

Yep, we need some indication that end-of-recovery was skipped at the
startup,
but I haven't added that since I wasn't sure do we really need
CHECKPOINT_END_OF_RECOVERY as part of the previous concern?

Some minor comments about the code (some of them probably doesn't
warrant immediate attention, but for the record...):

1. There are some places where we can use a local variable to store the
result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
brin_doupdate()

Ok.

2. Similarly, we can also capture the calls to GetWALProhibitState() in
a local variable where applicable. E.g. inside WALProhibitRequest().

I don't think so.

3. Some of the functions that were added such as GetWALProhibitState(),
IsWALProhibited() etc could be declared static inline.

IsWALProhibited() can be static but not GetWALProhibitState() since it
needed to
be accessible from other files.

4. IsWALProhibited(): Shouldn't it really be:
bool
IsWALProhibited(void)
{
uint32 walProhibitState = GetWALProhibitState();
return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
&& (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
}

I think the current one is better, this allows read-write transactions from
existing backend which has absorbed barrier or from new backend while we
changing stated to read-write in the assumption that we never fallback.

5. I think the comments:
/* Must be performing an INSERT or UPDATE, so we'll have an XID */
and
/* Can reach here from VACUUM, so need not have an XID */
can be internalized in the function/macro comment header.

Ok.

6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
advances */
We need to update the comment here.

Ok.

Will try to address all the above review comments in the next version along
with
Andres' concern/suggestion. Thanks again for your time.

Regards,
Amul

#47

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Andres Freund (#45)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 24, 2020 at 7:34 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for looking at the patch.

From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v3 2/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
src/backend/nodes/copyfuncs.c | 12 ++++++++++++
src/backend/nodes/equalfuncs.c | 9 +++++++++
src/backend/parser/gram.y | 13 +++++++++++++
src/backend/tcop/utility.c | 20 ++++++++++++++++++++
src/bin/psql/tab-complete.c | 6 ++++--
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++++++++++
src/tools/pgindent/typedefs.list | 1 +
8 files changed, 70 insertions(+), 2 deletions(-)

Shouldn't there be at outfuncs support as well? Perhaps we even need
readfuncs, not immediately sure.

Ok, can add that as well.

From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

1. When a user tried to change server state to WAL-Prohibited using
ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
barrier has been absorbed by all the backends.

2. When a backend receives the WAL-Prohibited barrier, at that moment if
it is already in a transaction and the transaction already assigned XID,
then the backend will be killed by throwing FATAL(XXX: need more discussion
on this)

I think we should consider introducing XACTFATAL or such, guaranteeing
the transaction gets aborted, without requiring a FATAL. This has been
needed for enough cases that it's worthwhile.

As I am aware of, the existing code PostgresMain() uses FATAL to terminate
the connection when protocol synchronization was lost. Currently, in
a proposal, this and another one is "Terminate the idle sessions"[1] is using
FATAL, afaik.

There are several cases where we WAL log without having an xid
assigned. E.g. when HOT pruning during syscache lookups or such. Are
there any cases where the check for being in recovery is followed by a
CHECK_FOR_INTERRUPTS, before the WAL logging is done?

In case of operation without xid, an error will be raised just before the point
where the wal record is expected. The places you are asking about, I haven't
found in a glance, will try to search for that, but I am sure current
implementation is not missing those places where it is supposed to check the
prohibited state and complaint.

Quick question, is it possible that pruning will happen with the SELECT query?
It would be helpful if you or someone else could point me to the place where WAL
can be generated even in the case of read-only queries.

3. Otherwise, if that backend running transaction which yet to get XID
assigned we don't need to do anything special, simply call
ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
XLogInsertAllowed() first which set ready only state appropriately.

4. A new transaction (from existing or new backend) starts as a read-only
transaction.

Why do we need 4)? And doesn't that have the potential to be
unnecessarily problematic if a the server is subsequently brought out of
the readonly state again?

The transaction that was started in the read-only system state will be read-only
until the end. I think that shouldn't be too problematic.

5. Auxiliary processes like autovacuum launcher, background writer,
checkpointer and walwriter will don't do anything in WAL-Prohibited
server state until someone wakes us up. E.g. a backend might later on
request us to put the system back to read-write.

Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do
anything in this state. bgwriter for example is even running entirely
normally in a hot standby node?

I think I missed to update the description when I reverted the
walwriter changes. The current version doesn't have any changes to
the walwriter. And bgwriter too behaves the same as it on the recovery
system. Will update this, sorry for the confusion.

6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
and xlog rotation. Starting up again will perform crash recovery(XXX:
need some discussion on this as well)

7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

8. Only super user can toggle WAL-Prohibit state.

9. Add system_is_read_only GUC show the system state -- will true when system
is wal prohibited or in recovery.
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+     if (!superuser())
+             ereport(ERROR,
+                             (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                              errmsg("must be superuser to execute ALTER SYSTEM command")));
ISTM we should rather do this in a GRANTable manner. We've worked
substantially towards that in the last few years.

I added this to be inlined with AlterSystemSetConfigFile(), if we want a
GRANTable manner, will try that.

+     /*
+      * WALProhibited indicates if we have stopped allowing WAL writes.
+      * Protected by info_lck.
+      */
+     bool            WALProhibited;
+
/*
* SharedHotStandbyActive indicates if we allow hot standby queries to be
* run.  Protected by info_lck.
@@ -7962,6 +7969,25 @@ StartupXLOG(void)
RequestCheckpoint(CHECKPOINT_FORCE);
}

+void
+MakeReadOnlyXLOG(void)
+{
+     SpinLockAcquire(&XLogCtl->info_lck);
+     XLogCtl->WALProhibited = true;
+     SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+     volatile XLogCtlData *xlogctl = XLogCtl;
+
+     return xlogctl->WALProhibited;
+}

What does this kind of locking achieving? It doesn't protect against
concurrent ALTER SYSTEM SET READ ONLY or such?

The 0004 patch improves that.

+             /*
+              * If the server is in WAL-Prohibited state then don't do anything until
+              * someone wakes us up. E.g. a backend might later on request us to put
+              * the system back to read-write.
+              */
+             if (IsWALProhibited())
+             {
+                     (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+                                                      WAIT_EVENT_CHECKPOINTER_MAIN);
+                     continue;
+             }
+
/*
* Detect a pending checkpoint request by checking whether the flags
* word in shared memory is nonzero.  We shouldn't need to acquire the
So if the ASRO happens while a checkpoint, potentially with a
checkpoint_timeout = 60d, it'll not take effect until the checkpoint has
finished.

But uh, as far as I can tell, the code would simply continue an
in-progress checkpoint, despite having absorbed the barrier. And then
we'd PANIC when doing the XLogInsert()?

I think this might not be the case with the next checkpointer changes in the
0004 patch.

diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..619c33cd780
--- /dev/null
+++ b/src/include/access/walprohibit.h
Not sure I like the mix of xlog/wal prefix for pretty closely related
files... I'm not convinced it's worth having a separate file for this,
fwiw.

I see.

From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 May 2020 06:39:43 -0400
Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or
READ-WRITE

Till the previous commit, the backend used to do this, but now the backend
requests checkpointer to do it. Checkpointer, noticing that the current state
is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
and then acknowledges back to the backend who requested the state change.

Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
prohibited state persistent across the system restarts.

The split between the previous commit and this commit seems more
confusing than useful to me.

By looking at the previous two review comments I agree with you. My
intention to make things easier for the reviewer. Will merge this patch
with the previous one.

+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+     /* Must not be called from checkpointer */
+     Assert(!AmCheckpointerProcess());
+     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+     /*
+      * If in a standalone backend, just do it ourselves.
+      */
+     if (!IsPostmasterEnvironment)
+     {
+             performWALProhibitStateChange(GetWALProhibitState());
+             return;
+     }
+
+     if (CheckpointerShmem->checkpointer_pid == 0)
+             elog(ERROR, "checkpointer is not running");
+
+     if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+             elog(ERROR, "could not signal checkpointer: %m");
+
+     /* Wait for the state to change to read-only */
+     ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
+     for (;;)
+     {
+             /*  We'll be done once in-progress flag bit is cleared */
+             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+                     break;
+
+             elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+             ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
+                                                        WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+     }
+     ConditionVariableCancelSleep();
+     elog(DEBUG1, "Done WALProhibitRequest");
+}

Isn't it possible that the system could have been changed back to be
read-write by the time the wakeup is being processed?

You have a point, the second backend will see the ASRW executed successfully
despite any changes by this. I think it better to have an error for the second
backend instead of silent. Will do the same.

From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:10:55 -0400
Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL
write

Isn't that the wrong order? This needs to come before the feature is
enabled, no?

Agreed but, IMHO, let it be, my intention behind the split is to make code read
easy and I don't think they are going to be check-in separately except 0001.

@@ -758,6 +759,9 @@ brinbuildempty(Relation index)
ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+     /* Building indexes will have an XID */
+     AssertWALPermitted_HaveXID();
+
Ugh, that's a pretty ugly naming scheme mix.

Ok.

@@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
brin_can_do_samepage_update(oldbuf, origsz, newsz))
{
+             /* Can reach here from VACUUM, so need not have an XID */
+             if (RelationNeedsWAL(idxrel))
+                     CheckWALPermitted();
+

Hm. Maybe I am confused, but why is that dependent on
RelationNeedsWAL()? Shouldn't readonly actually mean readonly, even if
no WAL is emitted?

To avoid the unnecessary error for the case where the wal record will not be
generated.

#include "access/genam.h"
#include "access/gist_private.h"
#include "access/transam.h"
+#include "access/walprohibit.h"
#include "commands/vacuum.h"
#include "lib/integerset.h"
#include "miscadmin.h"

The number of places that now need this new header - pretty much the
same set of files that do XLogInsert, already requiring an xlog* header
to be included - drives me further towards the conclusion that it's not
a good idea to have it separate.

Noted.

extern void ProcessInterrupts(void);

+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+     WALPERMIT_UNCHECKED,
+     WALPERMIT_CHECKED,
+     WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+     walpermit_checked_state = CritSectionCount ? \
+     WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+

Why are these in headers? And why is this tied to CritSectionCount?

If it is too bad we could think to move that. In the critical section, we don't
want the walpermit_checked_state flag to be reset by XLogResetInsertion()
otherwise following XLogBeginInsert() will have an assertion. The idea is that
anything that checks the flag changes it from UNCHECKED to CHECKED.
XLogResetInsertion() sets it to CHECKED_AND_USED if in a critical section and to
UNCHECKED otherwise (i.e. when CritSectionCount == 0).

Regards,
Amul

1] /messages/by-id/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com

#48

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#36)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

This doesn't really work because of the considerations mentioned in
/messages/by-id/CA+TgmoakCtzOZr0XEqaLFiMBcjE2rGcBAzf4EybpXjtNetpSVw@mail.gmail.com

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

It's basically impossible to create a system for fast failover that
involves a checkpoint. See my comments at
/messages/by-id/CA+TgmoYe8uCgtYFGfnv3vWpZTygsdkSu2F4MNiqhkar_UKbWfQ@mail.gmail.com
- you can't achieve five nines or even four nines of availability if
you have to wait for a checkpoint that might take twenty minutes. I
have nothing against a feature that does what you're describing, but
this feature is designed to make fast failover easier to accomplish,
and it's not going to succeed if it involves a checkpoint.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

There are a few things you can can imagine doing here:

1. Freeze WAL writes but allow dirty buffers to be flushed afterward.
This is the most useful thing for fast failover, I would argue,
because it's quick and the fact that some dirty buffers may not be
written doesn't matter.

2. Freeze WAL writes except a final checkpoint which will flush dirty
buffers along the way. This is like shutting the system down cleanly
and bringing it back up as a standby, except without performing a
shutdown.

3. Freeze WAL writes and write out all dirty buffers without actually
checkpointing. This is sort of a hybrid of #1 and #2. It's probably
not much faster than #2 but it avoids generating any more WAL.

4. Freeze WAL writes and just keep all the dirty buffers cached,
without writing them out. This seems like a bad idea for the reasons
mentioned in Amul's reply. The system might not be able to respond
even to read-only queries any more if shared_buffers is full of
unevictable dirty buffers.

Either #2 or #3 is sufficient to take a filesystem level snapshot of
the cluster while it's running, but I'm not sure why that's
interesting. You can already do that sort of thing by using
pg_basebackup or by running pg_start_backup() and pg_stop_backup() and
copying the directory in the middle, and you can do all of that while
the cluster is accepting writes, which seems like it will usually be
more convenient. If you do want this, you have several options, like
running a checkpoint immediately followed by ALTER SYSTEM READ ONLY
(so that the amount of WAL generated during the backup is small but
maybe not none); or shutting down the system cleanly and restarting it
as a standby; or maybe using the proposed pg_ctl demote feature
mentioned on a separate thread.

Contrary to what you write, I don't think either #2 or #3 is
sufficient to enable checksums, at least not without some more
engineering, because the server would cache the state from the control
file, and a bunch of blocks from the database. I guess it would work
if you did a server restart afterward, but I think there are better
ways of supporting online checksum enabling that don't require
shutting down the server, or even making it read-only; and there's
been significant work done on those already.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#49

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#42)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

In the read-only level I was suggesting, I wasn't suggesting that we
stop WAL flushes, in fact we should flush the WAL before we mark the
system as read-only. Once the system declares itself as read-only, it
will not perform any more on-disk changes; It may perform all the
flushes it needs as a part of the read-only request handling.

I think that's already how the patch works, or at least how it should
work. You stop new writes, flush any existing WAL, and then declare
the system read-only. That can all be done quickly.

What I am saying is it doesn't have to be just the queries. I think we
can cater to all the other use cases simply by forcing a checkpoint
before marking the system as read-only.

But that part can't, which means that if we did that, it would break
the feature for the originally intended use case. I'm not on board
with that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#50

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#45)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 10:04 PM Andres Freund <andres@anarazel.de> wrote:

I think we should consider introducing XACTFATAL or such, guaranteeing
the transaction gets aborted, without requiring a FATAL. This has been
needed for enough cases that it's worthwhile.

Seems like that would need a separate discussion, apart from this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Amul Sul (#46)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 23, 2020 at 10:14 PM Amul Sul <sulamul@gmail.com> wrote:

On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:

In case it is necessary, the patch set does not wait for the checkpoint to
complete before marking the system as read-write. Refer:

/* Set final state by clearing in-progress flag bit */
if (SetWALProhibitState(wal_state &

~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))

{
if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
ereport(LOG, (errmsg("system is now read only")));
else
{
/* Request checkpoint */
RequestCheckpoint(CHECKPOINT_IMMEDIATE);
ereport(LOG, (errmsg("system is now read write")));
}
}

We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
we SetWALProhibitState() and do the ereport(), if we have a read-write
state change request.

+1, I too have the same question.

FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
think
it will be deadlock case -- checkpointer process waiting for itself.

We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().

3. Some of the functions that were added such as GetWALProhibitState(),
IsWALProhibited() etc could be declared static inline.

IsWALProhibited() can be static but not GetWALProhibitState() since it
needed to
be accessible from other files.

If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.

Regards,
Soumyadeep

#52

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Robert Haas (#48)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 24, 2020 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

This doesn't really work because of the considerations mentioned in
/messages/by-id/CA+TgmoakCtzOZr0XEqaLFiMBcjE2rGcBAzf4EybpXjtNetpSVw@mail.gmail.com

Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't
think we should say "READ ONLY" if we still allow on-disk file changes
after the ALTER SYSTEM command returns (courtesy dirty buffer flushes)
because it does introduce confusion, especially to an audience not privy
to this thread. When people hear "read-only" they may think of static on-disk
files immediately.

Contrary to what you write, I don't think either #2 or #3 is
sufficient to enable checksums, at least not without some more
engineering, because the server would cache the state from the control
file, and a bunch of blocks from the database. I guess it would work
if you did a server restart afterward, but I think there are better
ways of supporting online checksum enabling that don't require
shutting down the server, or even making it read-only; and there's
been significant work done on those already.

Agreed. As you mentioned, if we did do #2 or #3, we would be able to do
pg_checksums on a server that was shut down or that had crashed while it
was in a read-only state, which is what Michael was asking for in [1]/messages/by-id/20200626095921.GF1504@paquier.xyz. I
think it's just cleaner if we allow for this.

I don't have enough context to enumerate use cases for the advantages or
opportunities that would come with an assurance that the cluster's files
are frozen (and not covered by any existing utilities), but surely there
are some? Like the possibility of pg_upgrade on a running server while
it can entertain read-only queries? Surely, that's a nice one!

Of course, some or all of these utilities would need to be taught about
read-only mode.

Regards,
Soumyadeep

[1]: /messages/by-id/20200626095921.GF1504@paquier.xyz

#53

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Robert Haas (#49)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 24, 2020 at 7:34 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

In the read-only level I was suggesting, I wasn't suggesting that we
stop WAL flushes, in fact we should flush the WAL before we mark the
system as read-only. Once the system declares itself as read-only, it
will not perform any more on-disk changes; It may perform all the
flushes it needs as a part of the read-only request handling.

I think that's already how the patch works, or at least how it should
work. You stop new writes, flush any existing WAL, and then declare
the system read-only. That can all be done quickly.

True, except for the fact that it allows dirty buffers to be flushed
after the ALTER command returns.

What I am saying is it doesn't have to be just the queries. I think we
can cater to all the other use cases simply by forcing a checkpoint
before marking the system as read-only.

But that part can't, which means that if we did that, it would break
the feature for the originally intended use case. I'm not on board
with that.

Referring to the options you presented in [1]/messages/by-id/CA+TgmoZ-c3Dz9QwHwmm4bc36N4u0XZ2OyENewMf+BwokbYdK9Q@mail.gmail.com:
I am saying that we should allow for both: with a checkpoint (#2) (can
also be #3) and without a checkpoint (#1) before having the ALTER
command return, by having different levels of read-onlyness.

We should have syntax variants for these. The syntax should not be an
ALTER SYSTEM SET as you have pointed out before. Perhaps:

ALTER SYSTEM READ ONLY; -- #2 or #3
ALTER SYSTEM READ ONLY WAL; -- #1
ALTER SYSTEM READ WRITE;

or even:

ALTER SYSTEM FREEZE; -- #2 or #3
ALTER SYSTEM FREEZE WAL; -- #1
ALTER SYSTEM UNFREEZE;

Regards,
Soumyadeep (VMware)

[1]: /messages/by-id/CA+TgmoZ-c3Dz9QwHwmm4bc36N4u0XZ2OyENewMf+BwokbYdK9Q@mail.gmail.com

#54

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#50)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

The attached version is updated w.r.t. some of the review comments
from Soumyadeep and Andres.

Two thing from Andres' review comment are not addressed are:

1. Only superuser allowed to execute AlterSystemSetWALProhibitState(). As
per
Andres instead we should do this in a GRANTable manner. I tried that but
got a little confused with the roles that we could use for ASRO and didn't
see
any much appropriate one. pg_signal_backend could have been suited for ASRO
where we terminate some of the backends but a user granted this role is not
supposed to terminate the superuser backend. If we used that we need to
check a
superuser backend and raise an error or warning. Other roles are
pg_write_server_files or pg_execute_server_program but I am not sure we
should
use either of this, seems a bit confusing to me. Any suggestion or am I
missing
something here?

2. About walprohibit.c/.h file, Andres' concern on file name is that WAL
related file names are started with xlog. I think renaming to xlog* will
not be
the correct and will be more confusing since function/variable/macros inside
walprohibit.c/.h files contain the walprohibit keyword. And another
concern is due to
separate file we have to include it to many places but I think that will be
one time pain and worth it to keep code modularised.

Andres, Robert, do let me know your opinion on this if you think we should
merge
walprohibit.c/.h file into xlog.c/.h, will do that in the next version.

Changes in the attached version are:

1. Renamed readonly_cv to walprohibit_cv.
2. Removed repetitive comments for CheckWALPermitted() &
AssertWALPermitted_HaveXID().
3. Renamed AssertWALPermitted_HaveXID() to AssertWALPermittedHaveXID().
4. Changes to avoid repeated RelationNeedsWAL() calls.
5. IsWALProhibited() made static inline function.
6. Added outfuncs and readfuncs functions.
7. Added error when read-only state transition is in progress and other
backends
trying to make the system read-write or vice versa. Previously 2nd backend
seeing
command that was executed successfully but it wasn't.
8. Merged checkpointer code changes patch to 0002.

Regards,
Amul

Attachments:

v4-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/x-patch; name=v4-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From c8d1f8abbc521442de2852c9d693e0a7a8477f7d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v4 1/5] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ece..13648887187 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v4-0002-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v4-0002-Add-alter-system-read-only-write-syntax.patchDownload

From 92bc6df6f0aff4274c9511c0afd94d2759be82a5 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v4 2/5] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de664..ba3393b8ccf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(WALProhibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5406,6 +5415,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e3f33c40be5..b09bff458af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(WALProhibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3458,6 +3464,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515da..37f297f39a5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1358,6 +1358,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(WALProhibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3914,6 +3923,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..0ac826d3c2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(WALProhibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2874,6 +2887,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dbb47d49829..6090d18ec61 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10172,8 +10173,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->WALProhibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9b0c376c8cb..7af96c77082 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2772,6 +2779,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3636,3 +3644,15 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 8b735476ade..65d1487f80f 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1858,9 +1858,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 151bcdb7ef5..f2c1ae8e3fe 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3194,6 +3194,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7eaaad1e140..247bdf1bacc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v4-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/x-patch; name=v4-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 15aabceff6dcffce60300623fed48ce3a2fced25 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v4 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 +++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 +++++-
 src/backend/access/hash/hash.c            | 19 +++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++--
 src/backend/access/hash/hashpage.c        |  9 ++++
 src/backend/access/heap/heapam.c          | 26 +++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++
 src/backend/access/spgist/spgvacuum.c     | 21 ++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 27 ++++++++----
 src/backend/access/transam/xloginsert.c   | 13 +++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 ++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/storage/lmgr/lock.c           |  6 +--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 50 ++++++++++++++++++++++-
 src/include/miscadmin.h                   | 27 ++++++++++++
 40 files changed, 502 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0c..3ea57bbb4bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -758,6 +759,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index e8b8308f82e..242c61b1f1e 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -333,6 +334,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -397,6 +399,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -408,7 +414,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -606,6 +612,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8d08b05f515..0d9997463b4 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -333,6 +334,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -378,6 +380,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -386,10 +389,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -410,7 +416,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -548,6 +554,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -588,7 +597,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedbc..1a7d777cc42 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..a8443e73cd5 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d62..fca136b0d41 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -134,6 +135,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -233,6 +237,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -464,9 +469,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -499,7 +507,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -525,6 +533,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -566,7 +577,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1640,6 +1651,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1658,13 +1670,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1681,7 +1696,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e77..b7707f90aa1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -466,6 +467,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -572,6 +574,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -602,7 +608,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -689,6 +695,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -787,6 +794,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -808,7 +818,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -882,6 +892,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -889,7 +902,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8df2716de46..cccd0ca891d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1881,6 +1882,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2154,6 +2157,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2672,6 +2677,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3424,6 +3431,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3597,6 +3606,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4530,6 +4541,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5321,6 +5334,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5479,6 +5494,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5587,6 +5604,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5703,6 +5722,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, parallel operations are required to be strictly read-only.
@@ -5733,6 +5753,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5743,7 +5767,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 256df4de105..0899078a1e2 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -77,11 +78,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	TransactionId OldestXmin;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -185,6 +186,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -225,6 +227,10 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 									 &prstate);
 	}
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -258,7 +264,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f75..89ab313c46b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -764,6 +765,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1203,6 +1205,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1218,7 +1223,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1471,6 +1476,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1488,7 +1496,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1910,6 +1918,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1917,6 +1926,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1942,7 +1954,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40d..9606e1752d9 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3a44bc09e0..33cc89ac392 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1900,13 +1903,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 70bac0052fc..295aa1577f1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1831,6 +1848,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1919,6 +1937,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1970,7 +1992,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2062,6 +2084,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2275,6 +2298,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2346,7 +2373,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..5ed12301763 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -501,10 +512,14 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemToPlaceholder[MaxIndexTuplesPerPage];
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
+	bool		needwal = RelationNeedsWAL(index);
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -580,7 +595,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ce84dac0c40..20005798e9a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2942,7 +2945,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0ec..141089d977a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1112,6 +1113,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2204,6 +2207,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2294,6 +2300,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e3..365de44321d 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -73,6 +74,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* Cannot assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextFullXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index a3f1a750744..6f66804735f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -16,6 +16,16 @@
 #include "postmaster/bgwriter.h"
 #include "storage/procsignal.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermittedHaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * ProcessBarrierWALProhibit()
  *
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8ebc4242c25..8f12e9f672b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1304,6 +1305,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1664,6 +1667,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6d33ea226b1..e3923ea1fe9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1028,7 +1028,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2863,9 +2863,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8855,6 +8857,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8884,6 +8888,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9112,6 +9118,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9269,6 +9277,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9915,7 +9925,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9929,10 +9939,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9954,8 +9964,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c526bb19281..506d7e97f38 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..8dacf48db24 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e8aa9a0ec3..5404b7bc126 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f1ae6f9f844..322b7a385cf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3638,13 +3638,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 95a21f6cc38..45d7fdf9485 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79bd..212312d5ae5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 163fe0d2fce..3442df5be2f 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -19,8 +19,8 @@ extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
@@ -29,4 +29,50 @@ extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
  */
 #define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 18bc8a7b904..63459305383 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

v4-0005-Documentation-WIP.patchapplication/x-patch; name=v4-0005-Documentation-WIP.patchDownload

From a8ea284ce517316a8fbae5c88cf8b2100b61e7be Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v4 5/5] Documentation - WIP

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 59 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 60 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..f62929f1660 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -433,8 +433,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -477,6 +477,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -522,7 +570,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -630,8 +679,8 @@ If the buffer is clean and checksums are in use then
 MarkBufferDirtyHint() inserts an XLOG_FPI record to ensure that we
 take a full page image that includes the hint. We do this to avoid
 a partial page write, when we write the dirtied page. WAL is not
-written during recovery, so we simply skip dirtying blocks because
-of hints when in recovery.
+written while in read only (i.e. during recovery or in WAL prohibit state), so
+we simply skip dirtying blocks because of hints when in read only.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

v4-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/x-patch; name=v4-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From 0cc6c9dc8b942222d0bf09d7b5cbd8f85c94cd30 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v4 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
	ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
	raises request to checkpointer by marking current state to inprogress in
	shared memory.  Checkpointer, noticing that the current state is has
	WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
	then acknowledges back to the backend who requested the state change once
	the transition has been completed.  Final state will be updated in control
	file to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction which yet to get XID
    assigned we don't need to do anything special, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (from existing or new backend) starts as a read-only
    transaction.

 5. Autovacuum launcher as well as checkpointer will don't do anything in
 	WAL-Prohibited server state until someone wakes us up.  E.g. a backend
	might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well)

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. Only super user can toggle WAL-Prohibit state.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c |  97 +++++++++++++++
 src/backend/access/transam/xact.c        |  49 +++++---
 src/backend/access/transam/xlog.c        | 150 +++++++++++++++++++++--
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    | 117 ++++++++++++++++++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/procsignal.c     |  26 +---
 src/backend/tcop/utility.c               |  14 +--
 src/backend/utils/misc/guc.c             |  26 ++++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  32 +++++
 src/include/access/xlog.h                |   3 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 18 files changed, 466 insertions(+), 73 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..a3f1a750744
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "postmaster/bgwriter.h"
+#include "storage/procsignal.h"
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of killing
+		 * transaction by throwing ERROR due to following reasons that need be
+		 * thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Cannot continue a transaction if it has performed writes while system is read only.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	uint32		state;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return; /* server is already in the desired state */
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	WALProhibitRequest();
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29847f..8ebc4242c25 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1949,23 +1949,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
@@ -4887,9 +4892,11 @@ CommitSubTransaction(void)
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
-	 * think it should leave the child state in place.
+	 * think it should leave the child state in place.  Note that the upper
+	 * transaction will be a force to ready-only irrespective of its previous
+	 * status if the server state is WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
@@ -5045,9 +5052,11 @@ AbortSubTransaction(void)
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
-	 * with the commit case.
+	 * with the commit case.  Note that the upper transaction will be a force
+	 * to ready-only irrespective of its previous status if the server state is
+	 * WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	RESUME_INTERRUPTS();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 184c6672f3b..6d33ea226b1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -245,9 +246,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -660,6 +662,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * SharedWALProhibitState indicates current WAL prohibit state.
+	 * Protected by info_lck.
+	 */
+	uint32		SharedWALProhibitState;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -970,6 +978,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -7706,6 +7715,15 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion update WAL prohibit state in shared memory
+	 * that will decide the further WAL insert should be allowed or not.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = ControlFile->wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7716,7 +7734,15 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7962,6 +7988,83 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Atomically return the current server WAL prohibited state */
+uint32
+GetWALProhibitState(void)
+{
+	uint32		state;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	state = XLogCtl->SharedWALProhibitState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return state;
+}
+
+/*
+ * SetWALProhibitState: Change current wal prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	/* Server is already in requested state */
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Prevent concurrent contrary in progress transition state setting */
+	if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+		(cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read only is already in progress"),
+					 errhint("Try after sometime again.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read write is already in progress"),
+					 errhint("Try after sometime again.")));
+
+	}
+
+	/* Update new state in share memory */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = new_state;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Update control file if it is the final state */
+	if (!(new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		bool	wal_prohibited = (new_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->wal_prohibited = wal_prohibited;
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	return true;
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8177,9 +8280,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8193,14 +8296,25 @@ XLogInsertAllowed(void)
 		return (bool) LocalXLogInsertAllowed;
 
 	/*
-	 * Else, must check to see if we're still in recovery.
+	 * Else, must check to see if we're still in recovery
 	 */
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8216,12 +8330,19 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
+
 	LocalXLogInsertAllowed = 1;
 
 	/* Initialize as RecoveryInProgress() would do when switching state */
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8513,7 +8634,10 @@ ShutdownXLOG(int code, Datum arg)
 
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	/*
+	 * Can't perform checkpoint or xlog rotation without writing WAL.
+	 */
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8526,6 +8650,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..f83f86994db 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427f..6c6ff7dc3af 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -268,7 +268,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b80..3e8aa9a0ec3 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -127,6 +128,9 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
+	ConditionVariable walprohibit_cv; /* signaled when requested wal
+										 prohibit state changes */
+
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
@@ -168,6 +172,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void performWALProhibitStateChange(uint32 wal_state);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -332,6 +337,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -342,6 +348,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			performWALProhibitStateChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -879,6 +907,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->walprohibit_cv);
 	}
 }
 
@@ -1109,6 +1138,94 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	return true;
 }
 
+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		performWALProhibitStateChange(GetWALProhibitState());
+		return;
+	}
+
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&CheckpointerShmem->walprohibit_cv);
+	for (;;)
+	{
+		/*  We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+		ConditionVariableSleep(&CheckpointerShmem->walprohibit_cv,
+							   WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+	elog(DEBUG1, "Done WALProhibitRequest");
+}
+
+/*
+ * performWALProhibitStateChange: checkpointer will call this to complete
+ * the requested WAL prohibit state transition.
+ */
+static void
+performWALProhibitStateChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/* Request checkpoint */
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&CheckpointerShmem->walprohibit_cv);
+}
+
 /*
  * CompactCheckpointerRequestQueue
  *		Remove duplicates from the request queue to avoid backend fsyncs.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2da2c..0eb40a86b52 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4057,6 +4057,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 13648887187..b973727a580 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 7af96c77082..d6411e4f3e9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3644,15 +3644,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	/* some code */
-	elog(INFO, "AlterSystemSetWALProhibitState() called");
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index abfa95a2314..a2f796164a7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12019,4 +12033,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df744..9594df76946 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..163fe0d2fce
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,32 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
+#endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 219a7299e1f..5f5c2146e7e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -325,6 +326,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(uint32 new_state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e5382..b32c7723275 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 13872013823..780c59f3e48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -955,6 +955,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..e8271b49f6d 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -35,6 +35,8 @@ extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
+extern void WALProhibitRequest(void);
+
 extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
-- 
2.22.0

#55

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#51)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 24, 2020 at 10:40 PM Soumyadeep Chakraborty <
soumyadeep2007@gmail.com> wrote:

On Thu, Jul 23, 2020 at 10:14 PM Amul Sul <sulamul@gmail.com> wrote:

On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <

soumyadeep2007@gmail.com> wrote:

In case it is necessary, the patch set does not wait for the

checkpoint to

complete before marking the system as read-write. Refer:

/* Set final state by clearing in-progress flag bit */
if (SetWALProhibitState(wal_state &

~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))

{
if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
ereport(LOG, (errmsg("system is now read only")));
else
{
/* Request checkpoint */
RequestCheckpoint(CHECKPOINT_IMMEDIATE);
ereport(LOG, (errmsg("system is now read write")));
}
}

We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT)

before

we SetWALProhibitState() and do the ereport(), if we have a read-write
state change request.

+1, I too have the same question.

FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise,

it

think
it will be deadlock case -- checkpointer process waiting for itself.

We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().

The only setting flag would have been enough for now, the next loop of
CheckpointerMain() will anyway be going to call CreateCheckPoint() without
waiting. I used RequestCheckpoint() to avoid duplicate flag setting code.
Also, I think RequestCheckpoint() will be better so that we don't need to
deal
will the standalone backend, the only imperfection is it will unnecessary
signal
itself, that would be fine I guess.

3. Some of the functions that were added such as GetWALProhibitState(),

IsWALProhibited() etc could be declared static inline.

IsWALProhibited() can be static but not GetWALProhibitState() since it
needed to
be accessible from other files.

If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.

Well, the current patch set also has few inline functions in the header
file.
But, I don't think we can do the same for GetWALProhibitState() without
changing
the XLogCtl structure scope which is local to xlog.c file and the changing
XLogCtl
scope would be a bad idea.

Regards,
Amul

#56

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Soumyadeep Chakraborty (#52)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 24, 2020 at 3:12 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't
think we should say "READ ONLY" if we still allow on-disk file changes
after the ALTER SYSTEM command returns (courtesy dirty buffer flushes)
because it does introduce confusion, especially to an audience not privy
to this thread. When people hear "read-only" they may think of static on-disk
files immediately.

They might think of a variety of things that are not a correct
interpretation of what the feature does, but I think the way to handle
that is to document it properly. I don't think making WAL a grammar
keyword just for this is a good idea. I'm not totally stuck on this
particular syntax if there's consensus on something else, but I
seriously doubt that there will be consensus around adding parser
keywords for this.

I don't have enough context to enumerate use cases for the advantages or
opportunities that would come with an assurance that the cluster's files
are frozen (and not covered by any existing utilities), but surely there
are some? Like the possibility of pg_upgrade on a running server while
it can entertain read-only queries? Surely, that's a nice one!

I think that this feature is plenty complicated enough already, and we
shouldn't make it more complicated to cater to additional use cases,
especially when those use cases are somewhat uncertain and would
probably require additional work in other parts of the system.

For instance, I think it would be great to have an option to start the
postmaster in a strictly "don't write ANYTHING" mode where regardless
of the cluster state it won't write any data files or any WAL or even
the control file. It would be useful for poking around on damaged
clusters without making things worse. And it's somewhat related to the
topic of this thread, but it's not THAT closely related. It's better
to add features one at a time; you can always add more later, but if
you make the individual ones too big and hard they don't get done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#57

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#56)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is a rebased on top of the latest master head (# 3e98c0bafb2).

Regards,
Amul

Attachments:

v5-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/x-patch; name=v5-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From 8629758383237c7ed1249fa6038c15fcd34f9ddd Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v5 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
	ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
	raises request to checkpointer by marking current state to inprogress in
	shared memory.  Checkpointer, noticing that the current state is has
	WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
	then acknowledges back to the backend who requested the state change once
	the transition has been completed.  Final state will be updated in control
	file to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction which yet to get XID
    assigned we don't need to do anything special, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (from existing or new backend) starts as a read-only
    transaction.

 5. Autovacuum launcher as well as checkpointer will don't do anything in
 	WAL-Prohibited server state until someone wakes us up.  E.g. a backend
	might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well)

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. Only super user can toggle WAL-Prohibit state.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c |  97 +++++++++++++++
 src/backend/access/transam/xact.c        |  49 +++++---
 src/backend/access/transam/xlog.c        | 150 +++++++++++++++++++++--
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    | 117 ++++++++++++++++++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/procsignal.c     |  26 +---
 src/backend/tcop/utility.c               |  14 +--
 src/backend/utils/misc/guc.c             |  26 ++++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  32 +++++
 src/include/access/xlog.h                |   3 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 18 files changed, 466 insertions(+), 73 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..a3f1a750744
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "postmaster/bgwriter.h"
+#include "storage/procsignal.h"
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of killing
+		 * transaction by throwing ERROR due to following reasons that need be
+		 * thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Cannot continue a transaction if it has performed writes while system is read only.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	uint32		state;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return; /* server is already in the desired state */
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	WALProhibitRequest();
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb13..98a1943f717 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
@@ -4903,9 +4908,11 @@ CommitSubTransaction(void)
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
-	 * think it should leave the child state in place.
+	 * think it should leave the child state in place.  Note that the upper
+	 * transaction will be a force to ready-only irrespective of its previous
+	 * status if the server state is WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
@@ -5064,9 +5071,11 @@ AbortSubTransaction(void)
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
-	 * with the commit case.
+	 * with the commit case.  Note that the upper transaction will be a force
+	 * to ready-only irrespective of its previous status if the server state is
+	 * WAL prohibited.
 	 */
-	XactReadOnly = s->prevXactReadOnly;
+	XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
 
 	RESUME_INTERRUPTS();
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae4..2091eff0d53 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -245,9 +246,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -657,6 +659,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * SharedWALProhibitState indicates current WAL prohibit state.
+	 * Protected by info_lck.
+	 */
+	uint32		SharedWALProhibitState;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -967,6 +975,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -7703,6 +7712,15 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion update WAL prohibit state in shared memory
+	 * that will decide the further WAL insert should be allowed or not.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = ControlFile->wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7713,7 +7731,15 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7959,6 +7985,83 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Atomically return the current server WAL prohibited state */
+uint32
+GetWALProhibitState(void)
+{
+	uint32		state;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	state = XLogCtl->SharedWALProhibitState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return state;
+}
+
+/*
+ * SetWALProhibitState: Change current wal prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	/* Server is already in requested state */
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Prevent concurrent contrary in progress transition state setting */
+	if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+		(cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read only is already in progress"),
+					 errhint("Try after sometime again.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read write is already in progress"),
+					 errhint("Try after sometime again.")));
+
+	}
+
+	/* Update new state in share memory */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedWALProhibitState = new_state;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Update control file if it is the final state */
+	if (!(new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		bool	wal_prohibited = (new_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->wal_prohibited = wal_prohibited;
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	return true;
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8174,9 +8277,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8190,14 +8293,25 @@ XLogInsertAllowed(void)
 		return (bool) LocalXLogInsertAllowed;
 
 	/*
-	 * Else, must check to see if we're still in recovery.
+	 * Else, must check to see if we're still in recovery
 	 */
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8213,12 +8327,19 @@ static void
 LocalSetXLogInsertAllowed(void)
 {
 	Assert(LocalXLogInsertAllowed == -1);
+
 	LocalXLogInsertAllowed = 1;
 
 	/* Initialize as RecoveryInProgress() would do when switching state */
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8510,7 +8631,10 @@ ShutdownXLOG(int code, Datum arg)
 
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	/*
+	 * Can't perform checkpoint or xlog rotation without writing WAL.
+	 */
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8523,6 +8647,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c6ec657a936..93273d1e6b4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427f..6c6ff7dc3af 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -268,7 +268,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b80..3e8aa9a0ec3 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -127,6 +128,9 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
+	ConditionVariable walprohibit_cv; /* signaled when requested wal
+										 prohibit state changes */
+
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
@@ -168,6 +172,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void performWALProhibitStateChange(uint32 wal_state);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -332,6 +337,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -342,6 +348,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			performWALProhibitStateChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -879,6 +907,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->walprohibit_cv);
 	}
 }
 
@@ -1109,6 +1138,94 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	return true;
 }
 
+/*
+ * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
+ * read-only.
+ */
+void
+WALProhibitRequest(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		performWALProhibitStateChange(GetWALProhibitState());
+		return;
+	}
+
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&CheckpointerShmem->walprohibit_cv);
+	for (;;)
+	{
+		/*  We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+		ConditionVariableSleep(&CheckpointerShmem->walprohibit_cv,
+							   WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+	elog(DEBUG1, "Done WALProhibitRequest");
+}
+
+/*
+ * performWALProhibitStateChange: checkpointer will call this to complete
+ * the requested WAL prohibit state transition.
+ */
+static void
+performWALProhibitStateChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/* Request checkpoint */
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&CheckpointerShmem->walprohibit_cv);
+}
+
 /*
  * CompactCheckpointerRequestQueue
  *		Remove duplicates from the request queue to avoid backend fsyncs.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1c..437da6ac473 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4057,6 +4057,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 13648887187..b973727a580 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 7af96c77082..d6411e4f3e9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3644,15 +3644,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	/* some code */
-	elog(INFO, "AlterSystemSetWALProhibitState() called");
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef70..dfc44136867 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12041,4 +12055,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..163fe0d2fce
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,32 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
+#endif		/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..183b2fa5a14 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -325,6 +326,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(uint32 new_state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 13872013823..780c59f3e48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -955,6 +955,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..e8271b49f6d 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -35,6 +35,8 @@ extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
+extern void WALProhibitRequest(void);
+
 extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
-- 
2.22.0

v5-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/x-patch; name=v5-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From df3bdd8d8f5c410e02dcf1f6d60317501e71cb8a Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v5 1/5] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ece..13648887187 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v5-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/x-patch; name=v5-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 6d6fc8f7470ce4b91d56536a6fa14fb761556aed Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v5 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 +++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 +++++-
 src/backend/access/hash/hash.c            | 19 +++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++--
 src/backend/access/hash/hashpage.c        |  9 ++++
 src/backend/access/heap/heapam.c          | 26 +++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++
 src/backend/access/spgist/spgvacuum.c     | 22 ++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 27 ++++++++----
 src/backend/access/transam/xloginsert.c   | 13 +++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 ++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/storage/lmgr/lock.c           |  6 +--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 50 ++++++++++++++++++++++-
 src/include/miscadmin.h                   | 27 ++++++++++++
 40 files changed, 503 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8d08b05f515..0d9997463b4 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -333,6 +334,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -378,6 +380,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -386,10 +389,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -410,7 +416,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -548,6 +554,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -588,7 +597,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 9cd6638df62..7c4cbee627a 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8eb276e4644..eb784bbf766 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2174,6 +2177,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2692,6 +2697,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3444,6 +3451,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3617,6 +3626,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4550,6 +4561,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5341,6 +5354,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5499,6 +5514,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5607,6 +5624,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5723,6 +5742,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, parallel operations are required to be strictly read-only.
@@ -5753,6 +5773,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5763,7 +5787,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 3ad4222cb8a..a0afbe76c2b 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -271,6 +273,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 	}
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -304,7 +310,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03c8e1ff7ea..d1117608160 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -764,6 +765,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1206,6 +1208,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1221,7 +1226,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1482,6 +1487,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1499,7 +1507,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1923,6 +1931,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1930,6 +1939,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1955,7 +1967,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c87..2c3d8aaecbd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1898,13 +1901,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7f392480ac0..8c3fc251a29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1832,6 +1849,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1920,6 +1938,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1971,7 +1993,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2064,6 +2086,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2277,6 +2300,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2356,7 +2383,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a4..0a88740764f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2938,7 +2941,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f9981e35..ff2bc8cc74b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index a3f1a750744..6f66804735f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -16,6 +16,16 @@
 #include "postmaster/bgwriter.h"
 #include "storage/procsignal.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermittedHaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * ProcessBarrierWALProhibit()
  *
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 98a1943f717..4e8c7bb7dcc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2091eff0d53..211abaca22a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1025,7 +1025,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2860,9 +2860,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8852,6 +8854,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8881,6 +8885,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9109,6 +9115,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9266,6 +9274,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9912,7 +9922,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9926,10 +9936,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9951,8 +9961,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c526bb19281..506d7e97f38 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..8dacf48db24 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e8aa9a0ec3..5404b7bc126 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b4..186cc47be1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3638,13 +3638,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 163fe0d2fce..3442df5be2f 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -19,8 +19,8 @@ extern bool ProcessBarrierWALProhibit(void);
 extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
@@ -29,4 +29,50 @@ extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
  */
 #define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif		/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

v5-0002-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v5-0002-Add-alter-system-read-only-write-syntax.patchDownload

From 32ce3f068a96fb8df185c935b11abb254697d7be Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v5 2/5] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de664..ba3393b8ccf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(WALProhibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5406,6 +5415,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e3f33c40be5..b09bff458af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(WALProhibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3458,6 +3464,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515da..37f297f39a5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1358,6 +1358,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(WALProhibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3914,6 +3923,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..0ac826d3c2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(WALProhibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2874,6 +2887,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dbb47d49829..6090d18ec61 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10172,8 +10173,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->WALProhibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9b0c376c8cb..7af96c77082 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2772,6 +2779,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3636,3 +3644,15 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index f41785f11c1..408f6260b26 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1864,9 +1864,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 151bcdb7ef5..f2c1ae8e3fe 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3194,6 +3194,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d990463ce9..29d6f6c968d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v5-0005-Documentation-WIP.patchapplication/x-patch; name=v5-0005-Documentation-WIP.patchDownload

From 7b25a9283429d63a7cb34aeba3f0836f40eaa6e9 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v5 5/5] Documentation - WIP

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

#58

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Amul Sul (#57)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is a rebased on top of the latest master head (# 3e98c0bafb2).

Does anyone, especially anyone named Andres Freund, have comments on
0001? That work is somewhat independent of the rest of this patch set
from a theoretical point of view, and it seems like if nobody sees a
problem with the line of attack there, it would make sense to go ahead
and commit that part. Considering that this global barrier stuff is
new and that I'm not sure how well we really understand the problems
yet, there's a possibility that we might end up revising these details
again. I understand that most people, including me, are somewhat
reluctant to see experimental code get committed, in this case that
ship has basically sailed already, since neither of the patches that
we thought would use the barrier mechanism end up making it into v13.
I don't think it's really making things any worse to try to improve
the mechanism.

0002 isn't separately committable, but I don't see anything wrong with it.

Regarding 0003:

I don't understand why ProcessBarrierWALProhibit() can safely assert
that the WALPROHIBIT_STATE_READ_ONLY is set.

+ errhint("Cannot continue a
transaction if it has performed writes while system is read only.")));

This sentence is bad because it makes it sound like the current
transaction successfully performed a write after the system had
already become read-only. I think something like errdetail("Sessions
with open write transactions must be terminated.") would be better.

I think SetWALProhibitState() could be in walprohibit.c rather than
xlog.c. Also, this function appears to have obvious race conditions.
It fetches the current state, then thinks things over while holding no
lock, and then unconditionally updates the current state. What happens
if somebody else has changed the state in the meantime? I had sort of
imagined that we'd use something like pg_atomic_uint32 for this and
manipulate it using compare-and-swap operations. Using some kind of
lock is probably fine, too, but you have to hold it long enough that
the variable can't change under you while you're still deciding
whether it's OK to modify it, or else recheck after reacquiring the
lock that the value doesn't differ from what you expect.

I think the choice to use info_lck to synchronize
SharedWALProhibitState is very strange -- what is the justification
for that? I thought the idea might be that we frequently need to check
SharedWALProhibitState at times when we'd be holding info_lck anyway,
but it looks to me like you always do separate acquisitions of
info_lck just for this, in which case I don't see why we should use it
here instead of a separate lock. For that matter, why does this need
to be part of XLogCtlData rather than a separate shared memory area
that is private to walprohibit.c?

-       else
+       /*
+        * Can't perform checkpoint or xlog rotation without writing WAL.
+        */
+       else if (XLogInsertAllowed())

Not project style.

+ case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:

Can we drop the word SYSTEM here to make this shorter, or would that
break some convention?

+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{

I'm not sure the comment is appropriate here, but I'm very sure the
extra spaces before "static" and "show" are not per style.

+ /* We'll be done once in-progress flag bit is cleared */

Another whitespace mistake.

+               elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+       elog(DEBUG1, "Done WALProhibitRequest");

I think these should be removed.

Can WALProhibitRequest() and performWALProhibitStateChange() be moved
to walprohibit.c, just to bring more of the code for this feature
together in one place? Maybe we could also rename them to
RequestWALProhibitChange() and CompleteWALProhibitChange()?

-        * think it should leave the child state in place.
+        * think it should leave the child state in place.  Note that the upper
+        * transaction will be a force to ready-only irrespective of
its previous
+        * status if the server state is WAL prohibited.
         */
-       XactReadOnly = s->prevXactReadOnly;
+       XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();

Both instances of this pattern seem sketchy to me. You don't expect
that reverting the state to a previous state will instead change to a
different state that doesn't match up with what you had before. What
is the bad thing that would happen if we did not make this change?

-        * Else, must check to see if we're still in recovery.
+        * Else, must check to see if we're still in recovery

Spurious change.

+                       /* Request checkpoint */
+                       RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+                       ereport(LOG, (errmsg("system is now read write")));

This does not seem right. Perhaps the intention here was that the
system should perform a checkpoint when it switches to read-write
state after having skipped the startup checkpoint. But why would we do
this unconditionally in all cases where we just went to a read-write
state?

There's probably quite a bit more to say about 0003 but I think I'm
running too low on mental energy to say more now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#58)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sat, Aug 29, 2020 at 1:23 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is a rebased on top of the latest master head (# 3e98c0bafb2).

Does anyone, especially anyone named Andres Freund, have comments on
0001? That work is somewhat independent of the rest of this patch set
from a theoretical point of view, and it seems like if nobody sees a
problem with the line of attack there, it would make sense to go ahead
and commit that part. Considering that this global barrier stuff is
new and that I'm not sure how well we really understand the problems
yet, there's a possibility that we might end up revising these details
again. I understand that most people, including me, are somewhat
reluctant to see experimental code get committed, in this case that
ship has basically sailed already, since neither of the patches that
we thought would use the barrier mechanism end up making it into v13.
I don't think it's really making things any worse to try to improve
the mechanism.

0002 isn't separately committable, but I don't see anything wrong with it.

Regarding 0003:

I don't understand why ProcessBarrierWALProhibit() can safely assert
that the WALPROHIBIT_STATE_READ_ONLY is set.

IF blocks entered to kill a transaction have valid XID & this happens only in
case of system state changing to READ_ONLY.

+ errhint("Cannot continue a
transaction if it has performed writes while system is read only.")));

This sentence is bad because it makes it sound like the current
transaction successfully performed a write after the system had
already become read-only. I think something like errdetail("Sessions
with open write transactions must be terminated.") would be better.

Ok, changed as suggested in the attached version.

I think SetWALProhibitState() could be in walprohibit.c rather than
xlog.c. Also, this function appears to have obvious race conditions.
It fetches the current state, then thinks things over while holding no
lock, and then unconditionally updates the current state. What happens
if somebody else has changed the state in the meantime? I had sort of
imagined that we'd use something like pg_atomic_uint32 for this and
manipulate it using compare-and-swap operations. Using some kind of
lock is probably fine, too, but you have to hold it long enough that
the variable can't change under you while you're still deciding
whether it's OK to modify it, or else recheck after reacquiring the
lock that the value doesn't differ from what you expect.

I think the choice to use info_lck to synchronize
SharedWALProhibitState is very strange -- what is the justification
for that? I thought the idea might be that we frequently need to check
SharedWALProhibitState at times when we'd be holding info_lck anyway,
but it looks to me like you always do separate acquisitions of
info_lck just for this, in which case I don't see why we should use it
here instead of a separate lock. For that matter, why does this need
to be part of XLogCtlData rather than a separate shared memory area
that is private to walprohibit.c?

In the attached patch I added a separate shared memory structure for WAL
prohibit state. SharedWALProhibitState is now pg_atomic_uint32 and part of that
structure instead of XLogCtlData. The shared state will be changed using a
compare-and-swap operation.

I hope that should be enough to avoid said race conditions.

-       else
+       /*
+        * Can't perform checkpoint or xlog rotation without writing WAL.
+        */
+       else if (XLogInsertAllowed())

Not project style.

Corrected.

+ case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:

Can we drop the word SYSTEM here to make this shorter, or would that
break some convention?

No issue, removed SYSTEM.

+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{

Fixed.

I'm not sure the comment is appropriate here, but I'm very sure the
extra spaces before "static" and "show" are not per style.

+ /* We'll be done once in-progress flag bit is cleared */

Another whitespace mistake.

Fixed.

+               elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+       elog(DEBUG1, "Done WALProhibitRequest");

I think these should be removed.

Removed.

Can WALProhibitRequest() and performWALProhibitStateChange() be moved
to walprohibit.c, just to bring more of the code for this feature
together in one place? Maybe we could also rename them to
RequestWALProhibitChange() and CompleteWALProhibitChange()?

Yes, I have moved these functions to walprohibit.c and renamed as suggested.
For this, I needed to add few helper functions to send a signal to checkpointer
and update Control File, as send_signal_to_checkpointer &
SetControlFileWALProhibitFlag() respectively, since checkpointer_pid
or ControlFile are not directly accessible from walprohibit.c

-        * think it should leave the child state in place.
+        * think it should leave the child state in place.  Note that the upper
+        * transaction will be a force to ready-only irrespective of
its previous
+        * status if the server state is WAL prohibited.
*/
-       XactReadOnly = s->prevXactReadOnly;
+       XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
Both instances of this pattern seem sketchy to me. You don't expect
that reverting the state to a previous state will instead change to a
different state that doesn't match up with what you had before. What
is the bad thing that would happen if we did not make this change?

We can drop these changes now since we are simply terminating sessions for those
who have performed or expected to perform write operations.

-        * Else, must check to see if we're still in recovery.
+        * Else, must check to see if we're still in recovery

Spurious change.

Fixed.

+                       /* Request checkpoint */
+                       RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+                       ereport(LOG, (errmsg("system is now read write")));
This does not seem right. Perhaps the intention here was that the
system should perform a checkpoint when it switches to read-write
state after having skipped the startup checkpoint. But why would we do
this unconditionally in all cases where we just went to a read-write
state?

You are correct since this could be expensive if the system changes to read-only
for a shorter period. For the initial version, I did this unconditionally to
avoid additional shared-memory variables in XLogCtlData but now WAL prohibits
state got its own shared-memory structure so that I have added the required
variable to it. Now, doing this checkpoint conditionally with
CHECKPOINT_END_OF_RECOVERY & CHECKPOINT_IMMEDIATE flag what we do in the
startup process. Note that to mark end-of-recovery checkpoint has been skipped
from the startup process I have added helper function as
MarkCheckPointSkippedInWalProhibitState(), I am not sure the name that I have
chosen is the best fit.

There's probably quite a bit more to say about 0003 but I think I'm
running too low on mental energy to say more now.

Thanks for your time and suggestions.

Regards,
Amul

Attachments:

v6-0002-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v6-0002-Add-alter-system-read-only-write-syntax.patchDownload

From c711defd1e1ae7121293b933eba686d5b241ca49 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v6 2/5] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de664..ba3393b8ccf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(WALProhibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5406,6 +5415,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e3f33c40be5..b09bff458af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(WALProhibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3458,6 +3464,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515da..37f297f39a5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1358,6 +1358,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(WALProhibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3914,6 +3923,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..0ac826d3c2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(WALProhibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2874,6 +2887,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dbb47d49829..6090d18ec61 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10172,8 +10173,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->WALProhibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6154d2c8c63..1730a5402f7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2806,6 +2813,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3670,3 +3678,15 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index f41785f11c1..408f6260b26 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1864,9 +1864,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 47d4c07306d..f82de3c126f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d990463ce9..29d6f6c968d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v6-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/x-patch; name=v6-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From 3230e0b1b77446eba50313466fc50ffddfe61ac7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v6 1/5] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ece..13648887187 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v6-0005-Documentation-WIP.patchapplication/x-patch; name=v6-0005-Documentation-WIP.patchDownload

From 6c508b597893a09af4d287284df281b3b1a11b65 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v6 5/5] Documentation - WIP

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

v6-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/x-patch; name=v6-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
    raises request to checkpointer by marking current state to inprogress in
    shared memory.  Checkpointer, noticing that the current state is has
    WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
    then acknowledges back to the backend who requested the state change once
    the transition has been completed.  Final state will be updated in control
    file to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction which yet to get XID
    assigned we don't need to do anything special, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (from existing or new backend) starts as a read-only
    transaction.

 5. Autovacuum launcher as well as checkpointer will don't do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. Only super user can toggle WAL-Prohibit state.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 321 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +--
 src/backend/access/transam/xlog.c        |  84 +++++-
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  37 +++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  26 +-
 src/backend/tcop/utility.c               |  14 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  83 ++++++
 src/include/access/xlog.h                |   2 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 20 files changed, 594 insertions(+), 68 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..8ab30be1d51
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,321 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/* Indicates current WAL prohibit state */
+	pg_atomic_uint32 SharedWALProhibitState;
+
+	/* Startup checkpoint pending */
+	bool		checkpointPending;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/* Should be here only for the WAL prohibit state. */
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	uint32		state;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return;					/* server is already in the desired state */
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange();
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(GetWALProhibitState());
+		return;
+	}
+
+	send_signal_to_checkpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		bool		wal_prohibited;
+
+		wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		/* Update the control file to make state persistent */
+		SetControlFileWALProhibitFlag(wal_prohibited);
+
+		if (wal_prohibited)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/*
+			 * Request checkpoint if the end-of-recovery checkpoint has been
+			 * skipped previously.
+			 */
+			if (WALProhibitState->checkpointPending)
+			{
+				RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+								  CHECKPOINT_IMMEDIATE);
+				WALProhibitState->checkpointPending = false;
+			}
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * GetWALProhibitState()
+ *
+ * Atomically return the current server WAL prohibited state
+ */
+uint32
+GetWALProhibitState(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Change current WAL prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+	bool		state_updated = false;
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	/* Server is already in requested state */
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Prevent concurrent contrary in progress transition state setting */
+	if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+		(cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read only is already in progress"),
+					 errhint("Try after sometime again.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read write is already in progress"),
+					 errhint("Try after sometime again.")));
+	}
+
+	/* Update new state in share memory */
+	state_updated =
+		pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
+									   &cur_state, new_state);
+
+	if (!state_updated)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("system read write state concurrently changed"),
+				 errhint("Try after sometime again.")));
+
+	return true;
+}
+
+/*
+ * MarkCheckPointSkippedInWalProhibitState()
+ *
+ * Sets checkpoint pending flag so that it can be performed next time while
+ * changing system state to WAL permitted.
+ */
+void
+MarkCheckPointSkippedInWalProhibitState(void)
+{
+	WALProhibitState->checkpointPending = true;
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+
+	pg_atomic_init_u32(&WALProhibitState->SharedWALProhibitState, 0);
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb13..188c299bed9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae4..e170df78a87 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -245,9 +246,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -967,6 +969,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -7703,6 +7706,14 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion update WAL prohibit state in shared memory
+	 * that will decide the further WAL insert should be allowed or not.
+	 */
+	(void) SetWALProhibitState(ControlFile->wal_prohibited ?
+							   WALPROHIBIT_STATE_READ_ONLY :
+							   WALPROHIBIT_STATE_READ_WRITE);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7713,7 +7724,16 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		MarkCheckPointSkippedInWalProhibitState();
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7959,6 +7979,25 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8174,9 +8213,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8195,9 +8234,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8219,6 +8269,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8523,6 +8583,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1b8cd7bacd4..aa4cdd57ec1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427f..6c6ff7dc3af 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -268,7 +268,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b80..b841721c9ec 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -332,6 +333,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -342,6 +344,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+send_signal_to_checkpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b236143..03b63a0d05f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4060,6 +4060,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd6..2d000ec2ff7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 13648887187..b973727a580 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1730a5402f7..513a7f324fb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3678,15 +3678,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	/* some code */
-	elog(INFO, "AlterSystemSetWALProhibitState() called");
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef70..774628b3cef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12041,4 +12055,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..14e1f5b3a2e
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,83 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void CompleteWALProhibitChange(uint32 wal_state);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(uint32 new_state);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateShmemInit(void);
+
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..2c423d6b609 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -325,6 +326,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf6..76f504ee277 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -956,6 +956,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..d72aa4c9fa0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void send_signal_to_checkpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29d6f6c968d..4411111a78e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2665,6 +2665,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.22.0

v6-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/x-patch; name=v6-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 9242327f51d45f2cdbe1b9ece99ed98366d1cbe0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v6 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 +++++--
 src/backend/access/heap/vacuumlazy.c      | 18 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 +++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++++++----
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 27 ++++++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++--
 src/backend/commands/sequence.c           | 16 ++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  4 +--
 src/include/miscadmin.h                   | 27 ++++++++++++++++
 40 files changed, 457 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8d08b05f515..0d9997463b4 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -333,6 +334,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -378,6 +380,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -386,10 +389,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -410,7 +416,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -548,6 +554,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -588,7 +597,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e53..d91ca2b391c 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9b5f417eac4..a411dbb128c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2174,6 +2177,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2692,6 +2697,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3444,6 +3451,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3617,6 +3626,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4550,6 +4561,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5341,6 +5354,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5499,6 +5514,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5607,6 +5624,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5723,6 +5742,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, parallel operations are required to be strictly read-only.
@@ -5753,6 +5773,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5763,7 +5787,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index bc510e2e9b3..9dcae7d2153 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 53b1a952543..4eff8865ca7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1482,6 +1487,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1499,7 +1507,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1932,6 +1940,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1939,6 +1948,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1964,7 +1976,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c87..2c3d8aaecbd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1898,13 +1901,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7f392480ac0..8c3fc251a29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1832,6 +1849,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1920,6 +1938,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1971,7 +1993,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2064,6 +2086,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2277,6 +2300,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2356,7 +2383,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a4..0a88740764f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2938,7 +2941,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f9981e35..ff2bc8cc74b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 8ab30be1d51..6b86e5cffcc 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -20,6 +20,16 @@
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermittedHaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 188c299bed9..abda095e735 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e170df78a87..1273cb1ed52 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1019,7 +1019,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2854,9 +2854,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8788,6 +8790,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8817,6 +8821,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9045,6 +9051,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9202,6 +9210,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9848,7 +9858,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9862,10 +9872,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9887,8 +9897,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c526bb19281..506d7e97f38 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..8dacf48db24 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b841721c9ec..81337d48d82 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -930,6 +930,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b4..186cc47be1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3638,13 +3638,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 14e1f5b3a2e..a876ec8a064 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -24,8 +24,8 @@ extern void MarkCheckPointSkippedInWalProhibitState(void);
 extern void WALProhibitStateShmemInit(void);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

#60

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#58)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On 2020-08-28 15:53:29 -0400, Robert Haas wrote:

On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is a rebased on top of the latest master head (# 3e98c0bafb2).

Does anyone, especially anyone named Andres Freund, have comments on
0001? That work is somewhat independent of the rest of this patch set
from a theoretical point of view, and it seems like if nobody sees a
problem with the line of attack there, it would make sense to go ahead
and commit that part.

It'd be easier to review the proposed commit if it included reasoning
about the change...

In particular, it looks to me like the commit actually implements two
different changes:
1) Allow a barrier function to "reject" a set barrier, because it can't
be set in that moment
2) Allow barrier functions to raise errors

and there's not much of an explanation as to why (probably somewhere
upthread, but ...)

/*
* ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);

 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);

This pattern seems like it'll get unwieldy with more than one barrier
type. And won't flag "unhandled" barrier types either (already the case,
I know). We could go for something like:

while (flags != 0)
{
barrier_bit = pg_rightmost_one_pos32(flags);
barrier_type = 1 >> barrier_bit;

switch (barrier_type)
{
case PROCSIGNAL_BARRIER_PLACEHOLDER:
processed = ProcessBarrierPlaceholder();
}

if (processed)
BARRIER_CLEAR_BIT(flags, barrier_type);
}

But perhaps that's too complicated?

+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);

For this to be correct, wouldn't flags need to be volatile? Otherwise
this might use a register value for flags, which might not contain the
correct value at this point.

Perhaps a comment explaining why we have to clear bits first would be
good?

+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();

+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}

I wish there were a way we could combine the PG_CATCH and this instance
of the same code. I'd probably just move into a helper.

It might be good to add a warning to WaitForProcSignalBarrier() or by
pss_barrierCheckMask indicating that it's *not* OK to look at
pss_barrierCheckMask when checking whether barriers have been processed.

Considering that this global barrier stuff is
new and that I'm not sure how well we really understand the problems
yet, there's a possibility that we might end up revising these details
again. I understand that most people, including me, are somewhat
reluctant to see experimental code get committed, in this case that
ship has basically sailed already, since neither of the patches that
we thought would use the barrier mechanism end up making it into v13.
I don't think it's really making things any worse to try to improve
the mechanism.

Yea, I have no problem with this.

Greetings,

Andres Freund

#61

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Amul Sul (#59)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...

On 2020-09-01 16:43:10 +0530, Amul Sul wrote:

diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..0ac826d3c2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
READ_DONE();
}

+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(WALProhibited);
+
+	READ_DONE();
+}
+

Why do we need readfuncs support for this?

+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* some code */
+	elog(INFO, "AlterSystemSetWALProhibitState() called");
+}

As long as it's not implemented it seems better to return an ERROR.

@@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
VariableSetStmt *setstmt; /* SET subcommand */
} AlterSystemStmt;
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		WALProhibited;
+} AlterSystemWALProhibitState;
+

All the nearby fields use under_score_style names.

From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

1. When a user tried to change server state to WAL-Prohibited using
ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
raises request to checkpointer by marking current state to inprogress in
shared memory. Checkpointer, noticing that the current state is has

"is has"

WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
then acknowledges back to the backend who requested the state change once
the transition has been completed. Final state will be updated in control
file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?

2. When a backend receives the WAL-Prohibited barrier, at that moment if
it is already in a transaction and the transaction already assigned XID,
then the backend will be killed by throwing FATAL(XXX: need more discussion
on this)

3. Otherwise, if that backend running transaction which yet to get XID
assigned we don't need to do anything special

Somewhat garbled sentence...

4. A new transaction (from existing or new backend) starts as a read-only
transaction.

Maybe "(in an existing or in a new backend)"?

5. Autovacuum launcher as well as checkpointer will don't do anything in
WAL-Prohibited server state until someone wakes us up. E.g. a backend
might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"

6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
and xlog rotation. Starting up again will perform crash recovery(XXX:
need some discussion on this as well) but the end of recovery checkpoint
will be skipped and it will be performed when the system changed to
WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.

8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;

9. Add system_is_read_only GUC show the system state -- will true when system
is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay. ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.

--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,321 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/* Indicates current WAL prohibit state */
+	pg_atomic_uint32 SharedWALProhibitState;
+
+	/* Startup checkpoint pending */
+	bool		checkpointPending;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;

You're using three different naming styles for as many members.

+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))

Hm. I wonder if this check is good enough. If you look at
RecordTransactionCommit() we also WAL log in some cases where no xid was
assigned. This is particularly true of (auto-)vacuum, but also for HOT
pruning.

I think it'd be good to put the logic of this check into xlog.c and
mirror the logic in RecordTransactionCommit(). And add cross-referencing
comments to RecordTransactionCommit and the new function, reminding our
futures selves that both places need to be modified.

+	{
+		/* Should be here only for the WAL prohibit state. */
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);

There are no races where an ASRO READ ONLY is quickly followed by ASRO
READ WRITE where this could be reached?

+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	uint32		state;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to execute ALTER SYSTEM command")));

See comments about this above.

+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Requested state */
+	state = stmt->WALProhibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	/*
+	 * Since we yet to convey this WAL prohibit state to all backend mark it
+	 * in-progress.
+	 */
+	state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	if (!SetWALProhibitState(state))
+		return;					/* server is already in the desired state */
+

This use of bitmasks seems unnecessary to me. I'd rather have one param
for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
for WALPROHIBIT_TRANSITION_IN_PROGRESS

+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(GetWALProhibitState());
+		return;
+	}
+
+	send_signal_to_checkpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();

What if somebody concurrently changes the state back to READ WRITE?
Won't we unnecessarily wait here?

That's probably fine, because we would just wait until that transition
is complete too. But at least a comment about that would be
good. Alternatively a "ASRO transitions completed counter" or such might
be a better idea?

+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all writes. */
+	XLogFlush(GetXLogWriteRecPtr());

Hm, maybe I'm missing something, but why is the write pointer the right
thing to flush? That won't include records that haven't been written to
disk yet... We also need to trigger writing out all WAL that is as of
yet unwritten, no? Without having thought a lot about it, it seems that
GetXLogInsertRecPtr() would be the right thing to flush?

+	/* Set final state by clearing in-progress flag bit */
+	if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+	{
+		bool		wal_prohibited;
+
+		wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+		/* Update the control file to make state persistent */
+		SetControlFileWALProhibitFlag(wal_prohibited);

Hm. Is there an issue with not WAL logging the control file change? Is
there a scenario where we a crash + recovery would end up overwriting
this?

+		if (wal_prohibited)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/*
+			 * Request checkpoint if the end-of-recovery checkpoint has been
+			 * skipped previously.
+			 */
+			if (WALProhibitState->checkpointPending)
+			{
+				RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+								  CHECKPOINT_IMMEDIATE);
+				WALProhibitState->checkpointPending = false;
+			}
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up the backend who requested the state change */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);

Could be multiple backends, right?

+}
+
+/*
+ * GetWALProhibitState()
+ *
+ * Atomically return the current server WAL prohibited state
+ */
+uint32
+GetWALProhibitState(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
+}

Is there an issue with needing memory barriers here?

+/*
+ * SetWALProhibitState()
+ *
+ * Change current WAL prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+	bool		state_updated = false;
+	uint32		cur_state;
+
+	cur_state = GetWALProhibitState();
+
+	/* Server is already in requested state */
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+		return false;
+
+	/* Prevent concurrent contrary in progress transition state setting */
+	if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+		(cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read only is already in progress"),
+					 errhint("Try after sometime again.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read write is already in progress"),
+					 errhint("Try after sometime again.")));
+	}
+
+	/* Update new state in share memory */
+	state_updated =
+		pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
+									   &cur_state, new_state);
+
+	if (!state_updated)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("system read write state concurrently changed"),
+				 errhint("Try after sometime again.")));
+

I don't think it's safe to use pg_atomic_compare_exchange_u32() outside
of a loop. I think there's platforms (basically all load-linked /
store-conditional architectures) where than can fail spuriously.

Also, there's no memory barrier around GetWALProhibitState, so there's
no guarantee it's not an out-of-date value you're starting with.

+/
+ * MarkCheckPointSkippedInWalProhibitState()
+ *
+ * Sets checkpoint pending flag so that it can be performed next time while
+ * changing system state to WAL permitted.
+ */
+void
+MarkCheckPointSkippedInWalProhibitState(void)
+{
+	WALProhibitState->checkpointPending = true;
+}

I don't *at all* like this living outside of xlog.c. I think this should
be moved there, and merged with deferring checkpoints in other cases
(promotions, not immediately performing a checkpoint after recovery).
There's state in ControlFile *and* here for essentially the same thing.

+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();

It's somewhat ugly that we call RecoveryInProgress() once in
XLogInsertAllowed() and then again directly here... It's probably fine
runtime cost wise, but...

/*
* Subroutine to try to fetch and validate a prior checkpoint record.
*
@@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
*/
WalSndWaitStopping();
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
if (RecoveryInProgress())
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())

Not sure I like going via XLogInsertAllowed(), that seems like a
confusing indirection here. And it encompasses things we atually don't
want to check for - it's fragile to also look at LocalXLogInsertAllowed
here imo.

ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1b8cd7bacd4..aa4cdd57ec1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])

HandleAutoVacLauncherInterrupts();

+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+

I think we really should have a different functions for places like
this. We don't want to generally hide bugs like e.g. starting the
autovac launcher in recovery, but this would.

@@ -342,6 +344,28 @@ CheckpointerMain(void)
AbsorbSyncRequests();
HandleCheckpointerInterrupts();

+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
/*
* Detect a pending checkpoint request by checking whether the flags
* word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)

return FirstCall;
}

So, if we're in the middle of a paced checkpoint with a large
checkpoint_timeout - a sensible real world configuration - we'll not
process ASRO until that checkpoint is over? That seems very much not
practical. What am I missing?

+/*
+ * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+send_signal_to_checkpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}

Sudden switch to a different naming style...

Greetings,

Andres Freund

#62

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Andres Freund (#61)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for your time.

Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...

On 2020-09-01 16:43:10 +0530, Amul Sul wrote:

diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..0ac826d3c2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
READ_DONE();
}

+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+     READ_LOCALS(AlterSystemWALProhibitState);
+
+     READ_BOOL_FIELD(WALProhibited);
+
+     READ_DONE();
+}
+

Why do we need readfuncs support for this?

I thought we need that from your previous comment[1].

+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+     /* some code */
+     elog(INFO, "AlterSystemSetWALProhibitState() called");
+}

As long as it's not implemented it seems better to return an ERROR.

Ok, will add an error in the next version.

@@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
VariableSetStmt *setstmt; /* SET subcommand */
} AlterSystemStmt;
+/* ----------------------
+ *           Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+     NodeTag         type;
+     bool            WALProhibited;
+} AlterSystemWALProhibitState;
+
All the nearby fields use under_score_style names.

I am not sure which nearby fields having the underscore that you are referring
to. Probably "WALProhibited" needs to be renamed to "walprohibited" to be
inline with the nearby fields.

From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

1. When a user tried to change server state to WAL-Prohibited using
ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
raises request to checkpointer by marking current state to inprogress in
shared memory. Checkpointer, noticing that the current state is has

"is has"

WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
then acknowledges back to the backend who requested the state change once
the transition has been completed. Final state will be updated in control
file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?

Once we've initiated the change to a read-only state, we probably want to
always either finish that change or go back to read-write, even if the process
that initiated the change is interrupted. Leaving the system in a
half-way-in-between state long term seems bad. Maybe we would have put some
background process, but choose the checkpointer in charge of making the state
change and to avoid the new background process to keep the first version patch
simple. The checkpointer isn't likely to get killed, but if it does, it will
be relaunched and the new one can clean things up. Probably later we might want
such a background worker that will be isn't likely to get killed.

2. When a backend receives the WAL-Prohibited barrier, at that moment if
it is already in a transaction and the transaction already assigned XID,
then the backend will be killed by throwing FATAL(XXX: need more discussion
on this)

3. Otherwise, if that backend running transaction which yet to get XID
assigned we don't need to do anything special

Somewhat garbled sentence...

4. A new transaction (from existing or new backend) starts as a read-only
transaction.

Maybe "(in an existing or in a new backend)"?

5. Autovacuum launcher as well as checkpointer will don't do anything in
WAL-Prohibited server state until someone wakes us up. E.g. a backend
might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"

Ok, I'll fix all of this. I usually don't much focus on the commit message text
but I try to make it as much as possible sane enough.

6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
and xlog rotation. Starting up again will perform crash recovery(XXX:
need some discussion on this as well) but the end of recovery checkpoint
will be skipped and it will be performed when the system changed to
WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.

I would be so thankful for the help.

8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;

Ok, currently we don't have SQL callable function to change the system
read-write state. Do you want me to add that? If so, any naming suggesting? How
about pg_make_system_read_only(bool) or have two function as
pg_make_system_read_only(void) & pg_make_system_read_write(void).

9. Add system_is_read_only GUC show the system state -- will true when system
is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay. ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.

Well, whether the system is in recovery or wal prohibited state it is read-only
for the user perspective, isn't it?

--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,321 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ *           PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+     /* Indicates current WAL prohibit state */
+     pg_atomic_uint32 SharedWALProhibitState;
+
+     /* Startup checkpoint pending */
+     bool            checkpointPending;
+
+     /* Signaled when requested WAL prohibit state changes */
+     ConditionVariable walprohibit_cv;

You're using three different naming styles for as many members.

Ill fix in the next version.

+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+     /*
+      * Kill off any transactions that have an XID *before* allowing the system
+      * to go WAL prohibit state.
+      */
+     if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
Hm. I wonder if this check is good enough. If you look at
RecordTransactionCommit() we also WAL log in some cases where no xid was
assigned. This is particularly true of (auto-)vacuum, but also for HOT
pruning.

I think it'd be good to put the logic of this check into xlog.c and
mirror the logic in RecordTransactionCommit(). And add cross-referencing
comments to RecordTransactionCommit and the new function, reminding our
futures selves that both places need to be modified.

I am not sure I have understood this, here is the snip from the implementation
detail from the first post[2]:

"Open transactions that don't have an XID are not killed, but will get an ERROR
if they try to acquire an XID later, or if they try to write WAL without
acquiring an XID (e.g. VACUUM). To make that happen, the patch adds a new
coding rule: a critical section that will write WAL must be preceded by a call
to CheckWALPermitted(), AssertWALPermitted(), or AssertWALPermitted_HaveXID().
The latter variants are used when we know for certain that inserting WAL here
must be OK, either because we have an XID (we would have been killed by a change
to read-only if one had occurred) or for some other reason."

Do let me know if you want further clarification.

+     {
+             /* Should be here only for the WAL prohibit state. */
+             Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
There are no races where an ASRO READ ONLY is quickly followed by ASRO
READ WRITE where this could be reached?

No, right now SetWALProhibitState() doesn't allow two transient wal prohibit
states at a time.

+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+     uint32          state;
+
+     if (!superuser())
+             ereport(ERROR,
+                             (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                              errmsg("must be superuser to execute ALTER SYSTEM command")));

See comments about this above.

+     /* Alter WAL prohibit state not allowed during recovery */
+     PreventCommandDuringRecovery("ALTER SYSTEM");
+
+     /* Requested state */
+     state = stmt->WALProhibited ?
+             WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+     /*
+      * Since we yet to convey this WAL prohibit state to all backend mark it
+      * in-progress.
+      */
+     state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+     if (!SetWALProhibitState(state))
+             return;                                 /* server is already in the desired state */
+

This use of bitmasks seems unnecessary to me. I'd rather have one param
for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
for WALPROHIBIT_TRANSITION_IN_PROGRESS

Ok.

How about the new version of SetWALProhibitState function as :
SetWALProhibitState(bool wal_prohibited, bool is_final_state) ?

+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(void)
+{
+     /* Must not be called from checkpointer */
+     Assert(!AmCheckpointerProcess());
+     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+     /*
+      * If in a standalone backend, just do it ourselves.
+      */
+     if (!IsPostmasterEnvironment)
+     {
+             CompleteWALProhibitChange(GetWALProhibitState());
+             return;
+     }
+
+     send_signal_to_checkpointer(SIGINT);
+
+     /* Wait for the state to change to read-only */
+     ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+     for (;;)
+     {
+             /* We'll be done once in-progress flag bit is cleared */
+             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+                     break;
+
+             ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+                                                        WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+     }
+     ConditionVariableCancelSleep();

What if somebody concurrently changes the state back to READ WRITE?
Won't we unnecessarily wait here?

Yes, there will be wait.

That's probably fine, because we would just wait until that transition
is complete too. But at least a comment about that would be
good. Alternatively a "ASRO transitions completed counter" or such might
be a better idea?

Ok, will add comments but could you please elaborate little a bit about "ASRO
transitions completed counter" and is there any existing counter I can refer
to?

+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 wal_state)
+{
+     uint64          barrierGeneration;
+
+     /*
+      * Must be called from checkpointer. Otherwise, it must be single-user
+      * backend.
+      */
+     Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+     Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+     /*
+      * WAL prohibit state change is initiated. We need to complete the state
+      * transition by setting requested WAL prohibit state in all backends.
+      */
+     elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+     /* Emit global barrier */
+     barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+     WaitForProcSignalBarrier(barrierGeneration);
+
+     /* And flush all writes. */
+     XLogFlush(GetXLogWriteRecPtr());

TBH, I am not an expert in this area. I wants to flush the latest record
pointer that needs to be flushed, I think GetXLogInsertRecPtr() would be fine
if is the latest one. Note that wal flushes are not blocked in read-only mode.

+     /* Set final state by clearing in-progress flag bit */
+     if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+     {
+             bool            wal_prohibited;
+
+             wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+             /* Update the control file to make state persistent */
+             SetControlFileWALProhibitFlag(wal_prohibited);

Hm. Is there an issue with not WAL logging the control file change? Is
there a scenario where we a crash + recovery would end up overwriting
this?

I am not sure. If the system crash before update this that means we haven't
acknowledged the system state change. And the server will be restarted with the
previous state.

Could you please explain what bothering you.

+             if (wal_prohibited)
+                     ereport(LOG, (errmsg("system is now read only")));
+             else
+             {
+                     /*
+                      * Request checkpoint if the end-of-recovery checkpoint has been
+                      * skipped previously.
+                      */
+                     if (WALProhibitState->checkpointPending)
+                     {
+                             RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+                                                               CHECKPOINT_IMMEDIATE);
+                             WALProhibitState->checkpointPending = false;
+                     }
+                     ereport(LOG, (errmsg("system is now read write")));
+             }
+     }
+
+     /* Wake up the backend who requested the state change */
+     ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);

Could be multiple backends, right?

Yes, you are correct, will fix that.

+}
+
+/*
+ * GetWALProhibitState()
+ *
+ * Atomically return the current server WAL prohibited state
+ */
+uint32
+GetWALProhibitState(void)
+{
+     return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
+}

Is there an issue with needing memory barriers here?

+/*
+ * SetWALProhibitState()
+ *
+ * Change current WAL prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+     bool            state_updated = false;
+     uint32          cur_state;
+
+     cur_state = GetWALProhibitState();
+
+     /* Server is already in requested state */
+     if (new_state == cur_state ||
+             new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+             return false;
+
+     /* Prevent concurrent contrary in progress transition state setting */
+     if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+             (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+     {
+             if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("system state transition to read only is already in progress"),
+                                      errhint("Try after sometime again.")));
+             else
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("system state transition to read write is already in progress"),
+                                      errhint("Try after sometime again.")));
+     }
+
+     /* Update new state in share memory */
+     state_updated =
+             pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
+                                                                        &cur_state, new_state);
+
+     if (!state_updated)
+             ereport(ERROR,
+                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                              errmsg("system read write state concurrently changed"),
+                              errhint("Try after sometime again.")));
+

Also, there's no memory barrier around GetWALProhibitState, so there's
no guarantee it's not an out-of-date value you're starting with.

How about having some kind of lock instead what Robert have suggested
previously[3] ?

+/
+ * MarkCheckPointSkippedInWalProhibitState()
+ *
+ * Sets checkpoint pending flag so that it can be performed next time while
+ * changing system state to WAL permitted.
+ */
+void
+MarkCheckPointSkippedInWalProhibitState(void)
+{
+     WALProhibitState->checkpointPending = true;
+}
I don't *at all* like this living outside of xlog.c. I think this should
be moved there, and merged with deferring checkpoints in other cases
(promotions, not immediately performing a checkpoint after recovery).

Here we want to perform the checkpoint sometime quite later when the
system state changes to read-write. For that, I think we need some flag
if we want this in xlog.c then we can have that flag in XLogCtl.

There's state in ControlFile *and* here for essentially the same thing.

I am sorry to trouble you much, but I haven't understood this too.

+      * If it is not currently possible to insert write-ahead log records,
+      * either because we are still in recovery or because ALTER SYSTEM READ
+      * ONLY has been executed, force this to be a read-only transaction.
+      * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+      * us from modifying data during recovery when !XLogInsertAllowed(), but
+      * this gives the normal indication to the user that the transaction is
+      * read-only.
+      *
+      * On the other hand, we only need to set the startedInRecovery flag when
+      * the transaction started during recovery, and not when WAL is otherwise
+      * prohibited. This information is used by RelationGetIndexScan() to
+      * decide whether to permit (1) relying on existing killed-tuple markings
+      * and (2) further killing of index tuples. Even when WAL is prohibited
+      * on the master, it's still the master, so the former is OK; and since
+      * killing index tuples doesn't generate WAL, the latter is also OK.
+      * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+      */
+     XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+     s->startedInRecovery = RecoveryInProgress();

It's somewhat ugly that we call RecoveryInProgress() once in
XLogInsertAllowed() and then again directly here... It's probably fine
runtime cost wise, but...

/*
* Subroutine to try to fetch and validate a prior checkpoint record.
*
@@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
*/
WalSndWaitStopping();
+     /*
+      * The restartpoint, checkpoint, or xlog rotation will be performed if the
+      * WAL writing is permitted.
+      */
if (RecoveryInProgress())
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-     else
+     else if (XLogInsertAllowed())

ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1b8cd7bacd4..aa4cdd57ec1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])

HandleAutoVacLauncherInterrupts();

+             /* If the server is read only just go back to sleep. */
+             if (!XLogInsertAllowed())
+                     continue;
+

I think we really should have a different functions for places like
this. We don't want to generally hide bugs like e.g. starting the
autovac launcher in recovery, but this would.

So, we need a separate function like XLogInsertAllowed() and a global variable
like LocalXLogInsertAllowed for the caching wal prohibit state.

@@ -342,6 +344,28 @@ CheckpointerMain(void)
AbsorbSyncRequests();
HandleCheckpointerInterrupts();

+             wal_state = GetWALProhibitState();
+
+             if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+             {
+                     /* Complete WAL prohibit state change request */
+                     CompleteWALProhibitChange(wal_state);
+                     continue;
+             }
+             else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+             {
+                     /*
+                      * Don't do anything until someone wakes us up.  For example a
+                      * backend might later on request us to put the system back to
+                      * read-write wal prohibit sate.
+                      */
+                     (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+                                                      WAIT_EVENT_CHECKPOINTER_MAIN);
+                     continue;
+             }
+             Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
/*
* Detect a pending checkpoint request by checking whether the flags
* word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)

return FirstCall;
}

Yes, the process doing ASRO will wait until that checkpoint is over.

+/*
+ * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+send_signal_to_checkpointer(int signum)
+{
+     if (CheckpointerShmem->checkpointer_pid == 0)
+             elog(ERROR, "checkpointer is not running");
+
+     if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+             elog(ERROR, "could not signal checkpointer: %m");
+}

Sudden switch to a different naming style...

My bad, sorry, will fix that.

Regards,
Amul

1] /messages/by-id/20200724020402.2byiiufsd7pw4hsp@alap3.anarazel.de
2] /messages/by-id/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com
3] /messages/by-id/CA+TgmoYMyw-m3O5XQ8tRy4mdEArGcfXr+9niO5Fmq1wVdKxYmQ@mail.gmail.com

#63

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Amul Sul (#62)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi Andres,

The attached patch has fixed the issue that you have raised & I have confirmed
in my previous email. Also, I tried to improve some of the things that you have
pointed but for those changes, I am a little unsure and looking forward to the
inputs/suggestions/confirmation on that, therefore 0003 patch is marked WIP.

Please have a look at my inline reply below for the things that are changes in
the attached version and need inputs:

On Sat, Sep 12, 2020 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for your time.
Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...

On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..0ac826d3c2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
READ_DONE();
}
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+     READ_LOCALS(AlterSystemWALProhibitState);
+
+     READ_BOOL_FIELD(WALProhibited);
+
+     READ_DONE();
+}
+
Why do we need readfuncs support for this?
I thought we need that from your previous comment[1].
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+     /* some code */
+     elog(INFO, "AlterSystemSetWALProhibitState() called");
+}
As long as it's not implemented it seems better to return an ERROR.
Ok, will add an error in the next version.
@@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
VariableSetStmt *setstmt; /* SET subcommand */
} AlterSystemStmt;
+/* ----------------------
+ *           Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+     NodeTag         type;
+     bool            WALProhibited;
+} AlterSystemWALProhibitState;
+
All the nearby fields use under_score_style names.
I am not sure which nearby fields having the underscore that you are referring
to. Probably "WALProhibited" needs to be renamed to "walprohibited" to be
inline with the nearby fields.

From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

1. When a user tried to change server state to WAL-Prohibited using
ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
raises request to checkpointer by marking current state to inprogress in
shared memory. Checkpointer, noticing that the current state is has

"is has"

WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
then acknowledges back to the backend who requested the state change once
the transition has been completed. Final state will be updated in control
file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?

Once we've initiated the change to a read-only state, we probably want to
always either finish that change or go back to read-write, even if the process
that initiated the change is interrupted. Leaving the system in a
half-way-in-between state long term seems bad. Maybe we would have put some
background process, but choose the checkpointer in charge of making the state
change and to avoid the new background process to keep the first version patch
simple. The checkpointer isn't likely to get killed, but if it does, it will
be relaunched and the new one can clean things up. Probably later we might want
such a background worker that will be isn't likely to get killed.

2. When a backend receives the WAL-Prohibited barrier, at that moment if
it is already in a transaction and the transaction already assigned XID,
then the backend will be killed by throwing FATAL(XXX: need more discussion
on this)

3. Otherwise, if that backend running transaction which yet to get XID
assigned we don't need to do anything special

Somewhat garbled sentence...

4. A new transaction (from existing or new backend) starts as a read-only
transaction.

Maybe "(in an existing or in a new backend)"?

5. Autovacuum launcher as well as checkpointer will don't do anything in
WAL-Prohibited server state until someone wakes us up. E.g. a backend
might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"

Ok, I'll fix all of this. I usually don't much focus on the commit message text
but I try to make it as much as possible sane enough.

6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
and xlog rotation. Starting up again will perform crash recovery(XXX:
need some discussion on this as well) but the end of recovery checkpoint
will be skipped and it will be performed when the system changed to
WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.

I would be so thankful for the help.

8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;

Ok, currently we don't have SQL callable function to change the system
read-write state. Do you want me to add that? If so, any naming suggesting? How
about pg_make_system_read_only(bool) or have two function as
pg_make_system_read_only(void) & pg_make_system_read_write(void).

In the attached version I added SQL callable function as
pg_alter_wal_prohibit_state(bool), and another suggestion for the naming is
welcome.

For the permission denied error for ASRO READ-ONLY/READ-WRITE, I have added
ereport() in AlterSystemSetWALProhibitState() instead of aclcheck_error() and
the hint is added. Any suggestions?

9. Add system_is_read_only GUC show the system state -- will true when system
is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay. ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.

Well, whether the system is in recovery or wal prohibited state it is read-only
for the user perspective, isn't it?
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,321 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ *           PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+     /* Indicates current WAL prohibit state */
+     pg_atomic_uint32 SharedWALProhibitState;
+
+     /* Startup checkpoint pending */
+     bool            checkpointPending;
+
+     /* Signaled when requested WAL prohibit state changes */
+     ConditionVariable walprohibit_cv;
You're using three different naming styles for as many members.
Ill fix in the next version.
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+     /*
+      * Kill off any transactions that have an XID *before* allowing the system
+      * to go WAL prohibit state.
+      */
+     if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
Hm. I wonder if this check is good enough. If you look at
RecordTransactionCommit() we also WAL log in some cases where no xid was
assigned. This is particularly true of (auto-)vacuum, but also for HOT
pruning.

I think it'd be good to put the logic of this check into xlog.c and
mirror the logic in RecordTransactionCommit(). And add cross-referencing
comments to RecordTransactionCommit and the new function, reminding our
futures selves that both places need to be modified.
I am not sure I have understood this, here is the snip from the implementation
detail from the first post[2]:

"Open transactions that don't have an XID are not killed, but will get an ERROR
if they try to acquire an XID later, or if they try to write WAL without
acquiring an XID (e.g. VACUUM). To make that happen, the patch adds a new
coding rule: a critical section that will write WAL must be preceded by a call
to CheckWALPermitted(), AssertWALPermitted(), or AssertWALPermitted_HaveXID().
The latter variants are used when we know for certain that inserting WAL here
must be OK, either because we have an XID (we would have been killed by a change
to read-only if one had occurred) or for some other reason."

Do let me know if you want further clarification.
+     {
+             /* Should be here only for the WAL prohibit state. */
+             Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
There are no races where an ASRO READ ONLY is quickly followed by ASRO
READ WRITE where this could be reached?
No, right now SetWALProhibitState() doesn't allow two transient wal prohibit
states at a time.
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+     uint32          state;
+
+     if (!superuser())
+             ereport(ERROR,
+                             (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                              errmsg("must be superuser to execute ALTER SYSTEM command")));
See comments about this above.
+     /* Alter WAL prohibit state not allowed during recovery */
+     PreventCommandDuringRecovery("ALTER SYSTEM");
+
+     /* Requested state */
+     state = stmt->WALProhibited ?
+             WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+     /*
+      * Since we yet to convey this WAL prohibit state to all backend mark it
+      * in-progress.
+      */
+     state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+     if (!SetWALProhibitState(state))
+             return;                                 /* server is already in the desired state */
+
This use of bitmasks seems unnecessary to me. I'd rather have one param
for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
for WALPROHIBIT_TRANSITION_IN_PROGRESS
Ok.

How about the new version of SetWALProhibitState function as :
SetWALProhibitState(bool wal_prohibited, bool is_final_state) ?

I have added the same.

+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(void)
+{
+     /* Must not be called from checkpointer */
+     Assert(!AmCheckpointerProcess());
+     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+     /*
+      * If in a standalone backend, just do it ourselves.
+      */
+     if (!IsPostmasterEnvironment)
+     {
+             CompleteWALProhibitChange(GetWALProhibitState());
+             return;
+     }
+
+     send_signal_to_checkpointer(SIGINT);
+
+     /* Wait for the state to change to read-only */
+     ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+     for (;;)
+     {
+             /* We'll be done once in-progress flag bit is cleared */
+             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+                     break;
+
+             ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+                                                        WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+     }
+     ConditionVariableCancelSleep();
What if somebody concurrently changes the state back to READ WRITE?
Won't we unnecessarily wait here?
Yes, there will be wait.

That's probably fine, because we would just wait until that transition
is complete too. But at least a comment about that would be
good. Alternatively a "ASRO transitions completed counter" or such might
be a better idea?

Ok, will add comments but could you please elaborate little a bit about "ASRO
transitions completed counter" and is there any existing counter I can refer
to?
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 wal_state)
+{
+     uint64          barrierGeneration;
+
+     /*
+      * Must be called from checkpointer. Otherwise, it must be single-user
+      * backend.
+      */
+     Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+     Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+     /*
+      * WAL prohibit state change is initiated. We need to complete the state
+      * transition by setting requested WAL prohibit state in all backends.
+      */
+     elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+     /* Emit global barrier */
+     barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+     WaitForProcSignalBarrier(barrierGeneration);
+
+     /* And flush all writes. */
+     XLogFlush(GetXLogWriteRecPtr());
Hm, maybe I'm missing something, but why is the write pointer the right
thing to flush? That won't include records that haven't been written to
disk yet... We also need to trigger writing out all WAL that is as of
yet unwritten, no? Without having thought a lot about it, it seems that
GetXLogInsertRecPtr() would be the right thing to flush?
TBH, I am not an expert in this area. I wants to flush the latest record
pointer that needs to be flushed, I think GetXLogInsertRecPtr() would be fine
if is the latest one. Note that wal flushes are not blocked in read-only mode.

Used GetXLogInsertRecPtr().

+     /* Set final state by clearing in-progress flag bit */
+     if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
+     {
+             bool            wal_prohibited;
+
+             wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
+
+             /* Update the control file to make state persistent */
+             SetControlFileWALProhibitFlag(wal_prohibited);

Hm. Is there an issue with not WAL logging the control file change? Is
there a scenario where we a crash + recovery would end up overwriting
this?

I am not sure. If the system crash before update this that means we haven't
acknowledged the system state change. And the server will be restarted with the
previous state.

Could you please explain what bothering you.

+             if (wal_prohibited)
+                     ereport(LOG, (errmsg("system is now read only")));
+             else
+             {
+                     /*
+                      * Request checkpoint if the end-of-recovery checkpoint has been
+                      * skipped previously.
+                      */
+                     if (WALProhibitState->checkpointPending)
+                     {
+                             RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+                                                               CHECKPOINT_IMMEDIATE);
+                             WALProhibitState->checkpointPending = false;
+                     }
+                     ereport(LOG, (errmsg("system is now read write")));
+             }
+     }
+
+     /* Wake up the backend who requested the state change */
+     ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);

Could be multiple backends, right?

Yes, you are correct, will fix that.

+}
+
+/*
+ * GetWALProhibitState()
+ *
+ * Atomically return the current server WAL prohibited state
+ */
+uint32
+GetWALProhibitState(void)
+{
+     return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
+}

Is there an issue with needing memory barriers here?

+/*
+ * SetWALProhibitState()
+ *
+ * Change current WAL prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+     bool            state_updated = false;
+     uint32          cur_state;
+
+     cur_state = GetWALProhibitState();
+
+     /* Server is already in requested state */
+     if (new_state == cur_state ||
+             new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+             return false;
+
+     /* Prevent concurrent contrary in progress transition state setting */
+     if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+             (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+     {
+             if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("system state transition to read only is already in progress"),
+                                      errhint("Try after sometime again.")));
+             else
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("system state transition to read write is already in progress"),
+                                      errhint("Try after sometime again.")));
+     }
+
+     /* Update new state in share memory */
+     state_updated =
+             pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
+                                                                        &cur_state, new_state);
+
+     if (!state_updated)
+             ereport(ERROR,
+                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                              errmsg("system read write state concurrently changed"),
+                              errhint("Try after sometime again.")));
+

Also, there's no memory barrier around GetWALProhibitState, so there's
no guarantee it's not an out-of-date value you're starting with.

How about having some kind of lock instead what Robert have suggested
previously[3] ?

I would like to discuss this point more. In the attached version I have added
WALProhibitLock to protect shared walprohibit state updates. I was a little
unsure do we want another spinlock what XLogCtlData has which is mostly used to
read the shared variable and for the update, both are used e.g. LogwrtResult.

Right now I haven't added and shared walprohibit state was fetch using a
volatile pointer. Do we need a spinlock there, I am not sure why? Thoughts?

+/
+ * MarkCheckPointSkippedInWalProhibitState()
+ *
+ * Sets checkpoint pending flag so that it can be performed next time while
+ * changing system state to WAL permitted.
+ */
+void
+MarkCheckPointSkippedInWalProhibitState(void)
+{
+     WALProhibitState->checkpointPending = true;
+}
I don't *at all* like this living outside of xlog.c. I think this should
be moved there, and merged with deferring checkpoints in other cases
(promotions, not immediately performing a checkpoint after recovery).
Here we want to perform the checkpoint sometime quite later when the
system state changes to read-write. For that, I think we need some flag
if we want this in xlog.c then we can have that flag in XLogCtl.

Right now I have added a new variable to XLogCtlData and moved this code to
xlog.c.

There's state in ControlFile *and* here for essentially the same thing.

I am sorry to trouble you much, but I haven't understood this too.
+      * If it is not currently possible to insert write-ahead log records,
+      * either because we are still in recovery or because ALTER SYSTEM READ
+      * ONLY has been executed, force this to be a read-only transaction.
+      * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+      * us from modifying data during recovery when !XLogInsertAllowed(), but
+      * this gives the normal indication to the user that the transaction is
+      * read-only.
+      *
+      * On the other hand, we only need to set the startedInRecovery flag when
+      * the transaction started during recovery, and not when WAL is otherwise
+      * prohibited. This information is used by RelationGetIndexScan() to
+      * decide whether to permit (1) relying on existing killed-tuple markings
+      * and (2) further killing of index tuples. Even when WAL is prohibited
+      * on the master, it's still the master, so the former is OK; and since
+      * killing index tuples doesn't generate WAL, the latter is also OK.
+      * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+      */
+     XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+     s->startedInRecovery = RecoveryInProgress();
It's somewhat ugly that we call RecoveryInProgress() once in
XLogInsertAllowed() and then again directly here... It's probably fine
runtime cost wise, but...
/*
* Subroutine to try to fetch and validate a prior checkpoint record.
*
@@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
*/
WalSndWaitStopping();
+     /*
+      * The restartpoint, checkpoint, or xlog rotation will be performed if the
+      * WAL writing is permitted.
+      */
if (RecoveryInProgress())
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-     else
+     else if (XLogInsertAllowed())
Not sure I like going via XLogInsertAllowed(), that seems like a
confusing indirection here. And it encompasses things we atually don't
want to check for - it's fragile to also look at LocalXLogInsertAllowed
here imo.
ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1b8cd7bacd4..aa4cdd57ec1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
HandleAutoVacLauncherInterrupts();
+             /* If the server is read only just go back to sleep. */
+             if (!XLogInsertAllowed())
+                     continue;
+
I think we really should have a different functions for places like
this. We don't want to generally hide bugs like e.g. starting the
autovac launcher in recovery, but this would.
So, we need a separate function like XLogInsertAllowed() and a global variable
like LocalXLogInsertAllowed for the caching wal prohibit state.
@@ -342,6 +344,28 @@ CheckpointerMain(void)
AbsorbSyncRequests();
HandleCheckpointerInterrupts();
+             wal_state = GetWALProhibitState();
+
+             if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+             {
+                     /* Complete WAL prohibit state change request */
+                     CompleteWALProhibitChange(wal_state);
+                     continue;
+             }
+             else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+             {
+                     /*
+                      * Don't do anything until someone wakes us up.  For example a
+                      * backend might later on request us to put the system back to
+                      * read-write wal prohibit sate.
+                      */
+                     (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+                                                      WAIT_EVENT_CHECKPOINTER_MAIN);
+                     continue;
+             }
+             Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
/*
* Detect a pending checkpoint request by checking whether the flags
* word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)
return FirstCall;
}
So, if we're in the middle of a paced checkpoint with a large
checkpoint_timeout - a sensible real world configuration - we'll not
process ASRO until that checkpoint is over? That seems very much not
practical. What am I missing?
Yes, the process doing ASRO will wait until that checkpoint is over.
+/*
+ * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+send_signal_to_checkpointer(int signum)
+{
+     if (CheckpointerShmem->checkpointer_pid == 0)
+             elog(ERROR, "checkpointer is not running");
+
+     if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+             elog(ERROR, "could not signal checkpointer: %m");
+}
Sudden switch to a different naming style...
My bad, sorry, will fix that.

1] /messages/by-id/20200724020402.2byiiufsd7pw4hsp@alap3.anarazel.de
2] /messages/by-id/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com
3] /messages/by-id/CA+TgmoYMyw-m3O5XQ8tRy4mdEArGcfXr+9niO5Fmq1wVdKxYmQ@mail.gmail.com

Thank you !

Regards,
Amul

Attachments:

v7-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/x-patch; name=v7-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From ac524d266aae75614da81e5868dd78f8bbb4c9db Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v7 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 +++++--
 src/backend/access/heap/vacuumlazy.c      | 18 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 +++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++++++----
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 27 ++++++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++--
 src/backend/commands/sequence.c           | 16 ++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  4 +--
 src/include/miscadmin.h                   | 27 ++++++++++++++++
 41 files changed, 465 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index eb96b4bb36d..53d8c9cea28 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 82788a5c367..f31590dcd75 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e53..d91ca2b391c 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9b5f417eac4..a411dbb128c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2174,6 +2177,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2692,6 +2697,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3444,6 +3451,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3617,6 +3626,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4550,6 +4561,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5341,6 +5354,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5499,6 +5514,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5607,6 +5624,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5723,6 +5742,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, parallel operations are required to be strictly read-only.
@@ -5753,6 +5773,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5763,7 +5787,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index bc510e2e9b3..9dcae7d2153 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..1869df5f03f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1482,6 +1487,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1499,7 +1507,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1932,6 +1940,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1939,6 +1948,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1964,7 +1976,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c87..2c3d8aaecbd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1898,13 +1901,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7f392480ac0..8c3fc251a29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1832,6 +1849,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1920,6 +1938,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1971,7 +1993,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2064,6 +2086,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2277,6 +2300,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2356,7 +2383,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a4..0a88740764f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2938,7 +2941,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f9981e35..ff2bc8cc74b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index fc6d6c975f4..d1257d201a0 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -24,6 +24,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermittedHaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 188c299bed9..abda095e735 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a192e01211..974bb36c51b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1024,7 +1024,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2859,9 +2859,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8818,6 +8820,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8847,6 +8851,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9075,6 +9081,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9232,6 +9240,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9878,7 +9888,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9892,10 +9902,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9917,8 +9927,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c526bb19281..506d7e97f38 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..8dacf48db24 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 97d8a62fd06..3ff6b954f1b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -944,6 +944,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b4..186cc47be1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3638,13 +3638,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 9d308a11cd0..574a4eba1a4 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -24,8 +24,8 @@ extern void MarkCheckPointSkippedInWalProhibitState(void);
 extern void WALProhibitStateShmemInit(void);
 
 /* WAL Prohibit States */
-#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
-#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000	/* WAL permitted */
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001	/* WAL prohibited */
 
 /*
  * The bit is used in state transition from one state to another.  When this
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

v7-0003-WIP-Implement-ALTER-SYSTEM-READ-ONLY-using-global.patchapplication/x-patch; name=v7-0003-WIP-Implement-ALTER-SYSTEM-READ-ONLY-using-global.patchDownload

From fcd2b4368f7b37aba497c8a314a2929b6c31e7d0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v7 3/5] WIP - Implement ALTER SYSTEM READ ONLY using global
 barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
    raises request to checkpointer by marking current state to inprogress in
    shared memory.  Checkpointer, noticing that the current state has
    WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
    then acknowledges back to the backend who requested the state change once
    the transition has been completed.  Final state will be updated in control
    file to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.
 5. The Autovacuum launcher, as well as the checkpointer, will not do
    anything while in the WAL-Prohibited server state until someone wakes
    up.  E.g. user might, later on, request us to put the system back to
    read-write by executing ALTER SYSTEM READ WRITE.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. To execute ALTER SYSTEM READ ONLY/WRITE, the user should have execute
    permssion on pg_alter_wal_prohibit_state() function.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.

======
TODOs
======
 1. TBC, WALProhibitLock is added to protect shared walprohibit state, is
    that correct or overkilled?

 2. TBC, not using any lock in GetWALProhibitState() to read the state.
    Since shared wal prohibit state can be updated only with exclusive
    WALProhibitLock, so reading via volatile pointer should be enough?

 3. TBC, name of SQL callable function picked is
    pg_alter_wal_prohibit_state(), could be better.

 4. TBC, Error message and hint added in AlterSystemSetWALProhibitState().

 5. TBC, SetWALProhibitState() parameter is changed as per Andres' suggestion.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 350 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +--
 src/backend/access/transam/xlog.c        | 114 +++++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  37 +++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  26 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/tcop/utility.c               |  15 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  83 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 662 insertions(+), 69 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..fc6d6c975f4
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,350 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state.
+	 * Update protected by WALProhibitLock.
+	 */
+	uint32		shared_walprohibit_state;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/* Should be here only for the WAL prohibit state. */
+		Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* Check permission for pg_alter_wal_prohibit_state() */
+	if (pg_proc_aclcheck(F_PG_ALTER_WAL_PROHIBIT_STATE,
+						 GetUserId(), ACL_EXECUTE) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for command ALTER SYSTEM"),
+				 errhint("Get execute permission for pg_alter_wal_prohibit_state() to this user.")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Execute function to alter wal prohibit state */
+	(void) OidFunctionCall1(F_PG_ALTER_WAL_PROHIBIT_STATE,
+							BoolGetDatum(stmt->walprohibited));
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	if (!SetWALProhibitState(walprohibited, false))
+	{
+		/* Server is already in the desired state */
+		PG_RETURN_BOOL(false);
+	}
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange();
+
+	/* Server state changed to the desired state */
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(void)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(GetWALProhibitState());
+		return;
+	}
+
+	SendsSignalToCheckpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once in-progress flag bit is cleared */
+		if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 wal_state)
+{
+	uint64		barrierGeneration;
+	bool		wal_prohibited;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
+
+	/* Emit global barrier */
+	barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrierGeneration);
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+	wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Set the final state */
+	if (SetWALProhibitState(wal_prohibited, true))
+	{
+		/* Update the control file to make state persistent */
+		SetControlFileWALProhibitFlag(wal_prohibited);
+
+		if (wal_prohibited)
+			ereport(LOG, (errmsg("system is now read only")));
+		else
+		{
+			/*
+			 * Request checkpoint if the end-of-recovery checkpoint has been
+			 * skipped previously.
+			 */
+			if (LastCheckPointIsSkipped())
+			{
+				RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+				SetLastCheckPointSkipped(false);
+			}
+			ereport(LOG, (errmsg("system is now read write")));
+		}
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * GetWALProhibitState()
+ *
+ * Return the current server WAL prohibited state
+ */
+uint32
+GetWALProhibitState(void)
+{
+	/*
+	 * Use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile WALProhibitStateData *cur_state = WALProhibitState;
+
+	return cur_state->shared_walprohibit_state;
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Change current WAL prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ *
+ */
+bool
+SetWALProhibitState(bool wal_prohibited, bool is_final_state)
+{
+	uint32		new_state;
+	uint32		cur_state;
+
+	/*
+	 * Only checkpointer or startup process or single-user can set the final wal
+	 * prohibit state.
+	 */
+	Assert(!is_final_state ||  AmCheckpointerProcess() || AmStartupProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/* Compute new state */
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	if (!is_final_state)
+		new_state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
+
+	LWLockAcquire(WALProhibitLock, LW_EXCLUSIVE);
+
+	/* Get the current state */
+	cur_state = GetWALProhibitState();
+
+	/* Server is already in requested state */
+	if (new_state == cur_state ||
+		new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		LWLockRelease(WALProhibitLock);
+		return false;
+	}
+
+	/* Prevent concurrent contrary in progress transition state setting */
+	if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+		(cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+	{
+		if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read only is already in progress"),
+					 errhint("Try after sometime again.")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("system state transition to read write is already in progress"),
+					 errhint("Try after sometime again.")));
+	}
+
+	/* Update new state in share memory */
+	WALProhibitState->shared_walprohibit_state = new_state;
+
+	LWLockRelease(WALProhibitLock);
+
+	return true;
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb13..188c299bed9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a38371a64f9..6a192e01211 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -245,9 +246,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -722,6 +724,11 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastCheckPointSkipped indicates if the last checkpoint is skipped.
+	 */
+	bool		lastCheckPointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -967,6 +974,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -6195,6 +6203,32 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetLastCheckPointSkipped(bool ChkptSkip)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->lastCheckPointSkipped = ChkptSkip;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Return value of lastCheckPointSkipped flag.
+ */
+bool
+LastCheckPointIsSkipped(void)
+{
+	bool	ChkptSkipped;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ChkptSkipped = XLogCtl->lastCheckPointSkipped;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return ChkptSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7703,6 +7737,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion update WAL prohibit state in shared memory
+	 * that will decide the further WAL insert should be allowed or not.
+	 */
+	(void) SetWALProhibitState(ControlFile->wal_prohibited, true);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7713,7 +7753,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetLastCheckPointSkipped(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7959,6 +8009,25 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	return (GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY) != 0;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8174,9 +8243,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8195,9 +8264,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8219,6 +8299,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8508,9 +8594,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8523,6 +8613,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ed4f3f142d8..79da249dd5c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1485,6 +1485,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 19ba26b914e..d40d2bce7f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -658,6 +658,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c96568149fe..3a8a40f7f3b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -282,7 +282,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 45f5deca72e..97d8a62fd06 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -346,6 +347,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -356,6 +358,28 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state = GetWALProhibitState();
+
+		if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state);
+			continue;
+		}
+		else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1337,3 +1361,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendsSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+SendsSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836a..95a738d7f25 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4060,6 +4060,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd6..2d000ec2ff7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 63cb70bcaa4..4f1b67f9d04 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd942..6824ce0aa3f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock					44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock				46
 NotifyQueueTailLock					47
+WALProhibitLock						48
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 74c2162cd59..05eac206182 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3691,16 +3691,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b842..24113249f67 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12041,4 +12055,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..9d308a11cd0
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,83 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void CompleteWALProhibitChange(uint32 wal_state);
+extern uint32 GetWALProhibitState(void);
+extern bool SetWALProhibitState(bool wal_prohibited, bool is_final_state);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateShmemInit(void);
+
+/* WAL Prohibit States */
+#define	WALPROHIBIT_STATE_READ_WRITE		0x0000
+#define	WALPROHIBIT_STATE_READ_ONLY			0x0001
+
+/*
+ * The bit is used in state transition from one state to another.  When this
+ * bit is set then the state indicated by the 0th position bit is yet to
+ * confirmed.
+ */
+#define WALPROHIBIT_TRANSITION_IN_PROGRESS	0x0002
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..2bcd37894f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetLastCheckPointSkipped(bool ChkptSkip);
+extern bool LastCheckPointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 687509ba926..246a9de91e3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10964,6 +10964,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4142', descr => 'alter system read only state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'bool',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4b..f9ff2360b35 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -956,6 +956,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..ad5e3ba5724 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void SendsSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 74af9665d00..f16efeb5d6a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2670,6 +2670,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.22.0

v7-0002-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v7-0002-Add-alter-system-read-only-write-syntax.patchDownload

From 0797fd2e9adf52066b605dafc14a707962611165 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v7 2/5] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 21 +++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0409a40b82a..b3c055a4b7e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(walprohibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5405,6 +5414,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e2d1b987bf4..5f5f289b8af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(walprohibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3457,6 +3463,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515da..4b98ed7f122 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1358,6 +1358,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(walprohibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3914,6 +3923,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..beb6540ecb9 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(walprohibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2874,6 +2887,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9f47745ee24..c900e5d8319 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10176,8 +10177,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->walprohibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9a35147b26a..74c2162cd59 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2819,6 +2826,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3683,3 +3691,16 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index f41785f11c1..408f6260b26 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1864,9 +1864,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e83329fd6d1..2ecaeff031b 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		walprohibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1afb345c36..74af9665d00 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v7-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/x-patch; name=v7-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From 9b0537dcece5e12cffa8709494e924790b24822c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v7 1/5] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1c..63cb70bcaa4 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v7-0005-WIP-Documentation.patchapplication/x-patch; name=v7-0005-WIP-Documentation.patchDownload

From 4b18e2a3c598b9abd040c2ba1ba844b012f0c3a0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v7 5/5] WIP - Documentation.

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

#64

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#60)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Sep 8, 2020 at 2:20 PM Andres Freund <andres@anarazel.de> wrote:

This pattern seems like it'll get unwieldy with more than one barrier
type. And won't flag "unhandled" barrier types either (already the case,
I know). We could go for something like:

while (flags != 0)
{
barrier_bit = pg_rightmost_one_pos32(flags);
barrier_type = 1 >> barrier_bit;

switch (barrier_type)
{
case PROCSIGNAL_BARRIER_PLACEHOLDER:
processed = ProcessBarrierPlaceholder();
}

if (processed)
BARRIER_CLEAR_BIT(flags, barrier_type);
}

But perhaps that's too complicated?

I don't mind a loop, but that one looks broken. We have to clear the
bit before we call the function that processes that type of barrier.
Otherwise, if we succeed in absorbing the barrier but a new instance
of the same barrier arrives meanwhile, we'll fail to realize that we
need to absorb the new one.

For this to be correct, wouldn't flags need to be volatile? Otherwise
this might use a register value for flags, which might not contain the
correct value at this point.

I think you're right.

Perhaps a comment explaining why we have to clear bits first would be
good?

Probably a good idea.

[ snipping assorted comments with which I agree ]

It might be good to add a warning to WaitForProcSignalBarrier() or by
pss_barrierCheckMask indicating that it's *not* OK to look at
pss_barrierCheckMask when checking whether barriers have been processed.

Not sure I understand this one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#65

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Amul Sul (#63)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Sep 15, 2020 at 2:35 PM Amul Sul <sulamul@gmail.com> wrote:

Hi Andres,

The attached patch has fixed the issue that you have raised & I have confirmed
in my previous email. Also, I tried to improve some of the things that you have
pointed but for those changes, I am a little unsure and looking forward to the
inputs/suggestions/confirmation on that, therefore 0003 patch is marked WIP.

Please have a look at my inline reply below for the things that are changes in
the attached version and need inputs:

On Sat, Sep 12, 2020 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote:

[... Skipped ....]

+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(void)
+{
+     /* Must not be called from checkpointer */
+     Assert(!AmCheckpointerProcess());
+     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+     /*
+      * If in a standalone backend, just do it ourselves.
+      */
+     if (!IsPostmasterEnvironment)
+     {
+             CompleteWALProhibitChange(GetWALProhibitState());
+             return;
+     }
+
+     send_signal_to_checkpointer(SIGINT);
+
+     /* Wait for the state to change to read-only */
+     ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+     for (;;)
+     {
+             /* We'll be done once in-progress flag bit is cleared */
+             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+                     break;
+
+             ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+                                                        WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+     }
+     ConditionVariableCancelSleep();
What if somebody concurrently changes the state back to READ WRITE?
Won't we unnecessarily wait here?
Yes, there will be wait.

That's probably fine, because we would just wait until that transition
is complete too. But at least a comment about that would be
good. Alternatively a "ASRO transitions completed counter" or such might
be a better idea?

Ok, will add comments but could you please elaborate little a bit about "ASRO
transitions completed counter" and is there any existing counter I can refer
to?

In an off-list discussion, Robert had explained to me this counter thing and
its requirement.

I tried to add the same as "shared WAL prohibited state generation" in the
attached version. The implementation is quite similar to the generation counter
in the super barrier. In the attached version, when a backend makes a request
for the WAL prohibit state changes then a generation number will be given to
that backend to wait on and that wait will be ended when the shared generation
counter changes.

[... Skipped ....]

+/*
+ * SetWALProhibitState()
+ *
+ * Change current WAL prohibit state to the input state.
+ *
+ * If the server is already completely moved to the requested WAL prohibit
+ * state, or if the desired state is same as the current state, return false,
+ * indicating that the server state did not change. Else return true.
+ */
+bool
+SetWALProhibitState(uint32 new_state)
+{
+     bool            state_updated = false;
+     uint32          cur_state;
+
+     cur_state = GetWALProhibitState();
+
+     /* Server is already in requested state */
+     if (new_state == cur_state ||
+             new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
+             return false;
+
+     /* Prevent concurrent contrary in progress transition state setting */
+     if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+             (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
+     {
+             if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("system state transition to read only is already in progress"),
+                                      errhint("Try after sometime again.")));
+             else
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("system state transition to read write is already in progress"),
+                                      errhint("Try after sometime again.")));
+     }
+
+     /* Update new state in share memory */
+     state_updated =
+             pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
+                                                                        &cur_state, new_state);
+
+     if (!state_updated)
+             ereport(ERROR,
+                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                              errmsg("system read write state concurrently changed"),
+                              errhint("Try after sometime again.")));
+

Also, there's no memory barrier around GetWALProhibitState, so there's
no guarantee it's not an out-of-date value you're starting with.

How about having some kind of lock instead what Robert have suggested
previously[3] ?

Right now I haven't added and shared walprohibit state was fetch using a
volatile pointer. Do we need a spinlock there, I am not sure why? Thoughts?

I reverted this WALProhibitLock implementation since with changes in the
attached version I don't think we need that locking.

Regards,
Amul

Attachments:

v8-0002-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v8-0002-Add-alter-system-read-only-write-syntax.patchDownload

From f1b4606aac0b28d22366e79a5e1168355d0a9589 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v8 2/5] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 21 +++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0409a40b82a..b3c055a4b7e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(walprohibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5405,6 +5414,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e2d1b987bf4..5f5f289b8af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(walprohibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3457,6 +3463,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515da..4b98ed7f122 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1358,6 +1358,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(walprohibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3914,6 +3923,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..beb6540ecb9 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(walprohibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2874,6 +2887,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 17653ef3a79..a0514e66c4a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10173,8 +10174,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->walprohibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9a35147b26a..74c2162cd59 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2819,6 +2826,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3683,3 +3691,16 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9c6f5ecb6a8..d28d8cea773 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1864,9 +1864,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4f..17d6942c734 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -412,6 +412,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 60c2f454660..340ee87f1bc 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3196,6 +3196,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		walprohibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1afb345c36..74af9665d00 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v8-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/x-patch; name=v8-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From 57458536d9dbfe21fb68a30aa43f7d2c12c13a80 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v8 1/5] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1c..63cb70bcaa4 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v8-0005-WIP-Documentation.patchapplication/x-patch; name=v8-0005-WIP-Documentation.patchDownload

From 3a3fe80a0fe513411248ed4eaa3322ada3c2c41b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v8 5/5] WIP - Documentation.

TODOs:

1] TBC, the section for pg_is_in_readonly() function, right now it is under "Recovery Information Functions"
2] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

v8-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/x-patch; name=v8-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 0a3c4071145bb050b7e5e8101471a592e5639b0d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v8 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 +++++--
 src/backend/access/heap/vacuumlazy.c      | 18 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 +++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++++++----
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 27 ++++++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++--
 src/backend/commands/sequence.c           | 16 ++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++
 40 files changed, 463 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index eb96b4bb36d..53d8c9cea28 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 82788a5c367..f31590dcd75 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e53..d91ca2b391c 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1585861a021..4d6052224fa 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2172,6 +2175,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2690,6 +2695,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3442,6 +3449,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3615,6 +3624,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4548,6 +4559,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5339,6 +5352,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5497,6 +5512,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5605,6 +5622,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5721,6 +5740,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5751,6 +5771,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5761,7 +5785,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index bc510e2e9b3..9dcae7d2153 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..1869df5f03f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1482,6 +1487,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1499,7 +1507,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1932,6 +1940,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1939,6 +1948,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1964,7 +1976,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c87..2c3d8aaecbd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1898,13 +1901,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7f392480ac0..8c3fc251a29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1832,6 +1849,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1920,6 +1938,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1971,7 +1993,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2064,6 +2086,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2277,6 +2300,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2356,7 +2383,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a4..0a88740764f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2938,7 +2941,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f9981e35..ff2bc8cc74b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 75f3924cc97..f273f75b41f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -24,6 +24,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of CheckWALPermitted,
+ * AssertWALPermittedHaveXID, or AssertWALPermitted must be called before
+ * starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 188c299bed9..abda095e735 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bfba75bbe80..253409ca065 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1025,7 +1025,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2860,9 +2860,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8826,6 +8828,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8855,6 +8859,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9083,6 +9089,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9240,6 +9248,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9886,7 +9896,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9900,10 +9910,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9925,8 +9935,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1f0e4e01e69..710806143d4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 6aab73bfd44..8dacf48db24 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e2ff484d367..a12e026ae02 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -942,6 +942,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b4..186cc47be1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3638,13 +3638,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

v8-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/x-patch; name=v8-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From c0db80b3d85630a026dc722ae0abe7dd51e6ffc8 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v8 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command or by calling
    pg_alter_wal_prohibit_state(true) sql function, the current state
    generation to inprogress in shared memory marked and signaled
    checkpointer process.  Checkpointer, noticing that the current state
    generation has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.
 5. The Autovacuum launcher, as well as the checkpointer, will not do
    anything while in the WAL-Prohibited server state until someone wakes
    up.  E.g. user might, later on, request us to put the system back to
    read-write by executing ALTER SYSTEM READ WRITE.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. To execute ALTER SYSTEM READ ONLY/WRITE, the user should have execute
    permssion on pg_alter_wal_prohibit_state() function.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 390 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 117 ++++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  39 +++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  26 +-
 src/backend/tcop/utility.c               |  15 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  94 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 22 files changed, 717 insertions(+), 69 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..75f3924cc97
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,390 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state generation and the last two bits of
+	 * this generation indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 shared_state_generation;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(uint32 cur_state_gen);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* Check permission for pg_alter_wal_prohibit_state() */
+	if (pg_proc_aclcheck(F_PG_ALTER_WAL_PROHIBIT_STATE,
+						 GetUserId(), ACL_EXECUTE) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for command ALTER SYSTEM"),
+				 errhint("Get execute permission for pg_alter_wal_prohibit_state() to this user.")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Execute function to alter wal prohibit state */
+	(void) OidFunctionCall1(F_PG_ALTER_WAL_PROHIBIT_STATE,
+							BoolGetDatum(stmt->walprohibited));
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+	uint32		cur_state_gen;
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	cur_state_gen = SetWALProhibitState(walprohibited, false);
+
+	/* Server is already in requested state */
+	if (!cur_state_gen)
+		PG_RETURN_VOID();
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange(cur_state_gen);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(uint32 cur_state_gen)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitStateGen() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(cur_state_gen);
+		return;
+	}
+
+	/* Signal checkpointer process */
+	SendsSignalToCheckpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once wal prohibit state generation changes */
+		if (GetWALProhibitStateGen() != cur_state_gen)
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 cur_state_gen)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+	wal_prohibited =
+		(WALPROHIBIT_NEXT_STATE(cur_state_gen) == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Set the final state */
+	(void) SetWALProhibitState(wal_prohibited, true);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (LastCheckPointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetLastCheckPointSkipped(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * GetWALProhibitStateGen()
+ *
+ * Atomically return the current server WAL prohibited state generation.
+ */
+uint32
+GetWALProhibitStateGen(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->shared_state_generation);
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Increments current shared WAL prohibit state generation concerning to
+ * requested state and returns the same.
+ *
+ * For the transition state request where is_final_state is false if the server
+ * desired transition state is the same as the current state which might have
+ * been requested by some other backend and has been proceeded then the current
+ * wal prohibit generation will be returned so that this backend can wait until
+ * the shared wal prohibited generation change for the final state.  And, if the
+ * server is already completely moved to the requested state then the requester
+ * backend doesn't need to wait, in that case, 0 will be returned.
+ *
+ * For the final state request which can be only requested by the checkpointer
+ * or by the single-user so that there is no chance that the server already is
+ * in the desired final state.
+ */
+uint32
+SetWALProhibitState(bool wal_prohibited, bool is_final_state)
+{
+	uint32		new_state;
+	uint32		cur_state;
+	uint32		cur_state_gen;
+	uint32		next_state_gen;
+
+	/* Get the current state */
+	cur_state_gen = GetWALProhibitStateGen();
+	cur_state = WALPROHIBIT_CURRENT_STATE(cur_state_gen);
+
+	/* Compute new state */
+	if (is_final_state)
+	{
+		/*
+		 * Only checkpointer or single-user can set the final wal prohibit
+		 * state.
+		 */
+		Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+		/*
+		 * There won't be any other process for the final state setting so that
+		 * the next final state will be the desired state.
+		 */
+		Assert(WALPROHIBIT_NEXT_STATE(cur_state) == new_state);
+	}
+	else
+	{
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_GOING_READ_ONLY :
+			WALPROHIBIT_STATE_GOING_READ_WRITE;
+
+		/* Server is already in the requested transition state */
+		if (cur_state == new_state)
+			return cur_state;		/* Wait for state transition completion */
+
+		/* Server is already in requested state */
+		if (WALPROHIBIT_NEXT_STATE(new_state) == cur_state)
+			return 0;		/* No wait is needed */
+
+		/* Prevent concurrent contrary in progress transition state setting */
+		if (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+		}
+	}
+
+	/*
+	 * Update new state generation in share memory only if the state generation
+	 * hasn't changed until now we have checked.
+	 */
+	next_state_gen = cur_state_gen + 1;
+	(void) pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+										  &cur_state_gen, next_state_gen);
+
+	/* To be sure that any later reads of memory happen strictly after this. */
+	pg_memory_barrier();
+
+	return next_state_gen;
+}
+
+/*
+ * WALProhibitStateGenerationInit()
+ *
+ * Initialization of shared wal prohibit state generation.
+ */
+void
+WALProhibitStateGenerationInit(bool wal_prohibited)
+{
+	uint32	new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibitState->shared_state_generation, new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb13..188c299bed9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e26..bfba75bbe80 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -246,9 +247,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -723,6 +725,11 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastCheckPointSkipped indicates if the last checkpoint is skipped.
+	 */
+	bool		lastCheckPointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -968,6 +975,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -6196,6 +6204,32 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetLastCheckPointSkipped(bool ChkptSkip)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->lastCheckPointSkipped = ChkptSkip;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Return value of lastCheckPointSkipped flag.
+ */
+bool
+LastCheckPointIsSkipped(void)
+{
+	bool	ChkptSkipped;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ChkptSkipped = XLogCtl->lastCheckPointSkipped;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return ChkptSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7708,6 +7742,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateGenerationInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7718,7 +7758,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetLastCheckPointSkipped(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7964,6 +8014,28 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	uint32 		cur_state = WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8179,9 +8251,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8200,9 +8272,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8224,6 +8307,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8513,9 +8602,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8528,6 +8621,10 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ed4f3f142d8..79da249dd5c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1485,6 +1485,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115f..efee35cbc94 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -659,6 +659,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b61..1d9c46de20a 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e7dcd4f764..e2ff484d367 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -342,6 +343,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state_gen;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -352,6 +354,30 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state_gen = GetWALProhibitStateGen();
+
+		if (wal_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state_gen);
+			continue;
+		}
+		else if (WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+				 WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1333,3 +1359,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendsSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+SendsSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836a..95a738d7f25 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4060,6 +4060,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd6..2d000ec2ff7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 63cb70bcaa4..4f1b67f9d04 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 74c2162cd59..05eac206182 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3691,16 +3691,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b842..24113249f67 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12041,4 +12055,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..61836d61844
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,94 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void CompleteWALProhibitChange(uint32 wal_state);
+extern uint32 GetWALProhibitStateGen(void);
+extern uint32 SetWALProhibitState(bool wal_prohibited, bool is_final_state);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateGenerationInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+
+/*
+ * The WAL Prohibit States.
+ *
+ * 	The odd number represents the transition state and whereas the even number
+ * 	represents the final state.  These states can be distinguished by checking
+ * 	the 0th bits aka transition bit.
+ */
+#define	WALPROHIBIT_STATE_READ_WRITE		(uint32) 0	/* WAL permitted */
+#define	WALPROHIBIT_STATE_GOING_READ_ONLY	(uint32) 1
+#define	WALPROHIBIT_STATE_READ_ONLY			(uint32) 2	/* WAL prohibited */
+#define	WALPROHIBIT_STATE_GOING_READ_WRITE	(uint32) 3
+
+/* The transition bit to distinguish states.  */
+#define	WALPROHIBIT_TRANSITION_IN_PROGRESS	((uint32) 1 << 0)
+
+/* Extract last two bits */
+#define	WALPROHIBIT_CURRENT_STATE(stateGeneration)	\
+	((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define	WALPROHIBIT_NEXT_STATE(stateGeneration)	\
+	WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..2bcd37894f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetLastCheckPointSkipped(bool ChkptSkip);
+extern bool LastCheckPointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f48f5fb4d99..41b8fe02b3e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10964,6 +10964,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4142', descr => 'alter system read only state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4b..f9ff2360b35 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -956,6 +956,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..ad5e3ba5724 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void SendsSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 74af9665d00..f16efeb5d6a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2670,6 +2670,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.22.0

#66

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Amul Sul (#65)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is a rebased version for the latest master head(#e21cbb4b893).

Regards,
Amul

Attachments:

v9-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/octet-stream; name=v9-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload

From 5f1ec1b1aa8d26c3d81bec9e65207861fd0dcb86 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:27:53 -0400
Subject: [PATCH v9 1/5] Allow error or refusal while absorbing barriers.

Patch by Robert Haas
---
 src/backend/storage/ipc/procsignal.c | 75 +++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1c..63cb70bcaa4 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -87,12 +87,16 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+				&& ProcessBarrierPlaceholder())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, add any flags that weren't yet handled
+			 * back into pss_barrierCheckMask, and reset the global variables
+			 * so that we try again the next time we check for interrupts.
+			 */
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier was not successfully absorbed, we will have to try
+		 * again later.
+		 */
+		if (flags != 0)
+		{
+			pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+								   flags);
+			ProcSignalBarrierPending = true;
+			InterruptPending = true;
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +554,7 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static void
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +564,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.22.0

v9-0002-Add-alter-system-read-only-write-syntax.patchapplication/octet-stream; name=v9-0002-Add-alter-system-read-only-write-syntax.patchDownload

From fed340cae51218fa3c4e0ae0cd80e8e08f50a1f0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v9 2/5] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 21 +++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0409a40b82a..b3c055a4b7e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4020,6 +4020,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(walprohibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5405,6 +5414,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e2d1b987bf4..5f5f289b8af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(walprohibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3457,6 +3463,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f0386480ab8..c8f89e6f635 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1358,6 +1358,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(walprohibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3915,6 +3924,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab7195..beb6540ecb9 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(walprohibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2874,6 +2887,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 17653ef3a79..a0514e66c4a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10173,8 +10174,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->walprohibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9a35147b26a..74c2162cd59 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2819,6 +2826,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3683,3 +3691,16 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 24c7b414cf3..9356f13e509 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1864,9 +1864,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7ddd8c011bf..7b233925692 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -411,6 +411,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 60c2f454660..340ee87f1bc 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3196,6 +3196,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		walprohibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9cd1179af63..db6f86e02ab 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.22.0

v9-0005-WIP-Documentation.patchapplication/octet-stream; name=v9-0005-WIP-Documentation.patchDownload

From d384897832a832b268952fb7312be1a110e96f1e Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v9 5/5] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.22.0

v9-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/octet-stream; name=v9-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload

From 86bd090ce66684c582fe88821b2e4fcafc0cf71a Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v9 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command or by calling
    pg_alter_wal_prohibit_state(true) sql function, the current state
    generation to inprogress in shared memory marked and signaled
    checkpointer process.  Checkpointer, noticing that the current state
    generation has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.
 5. The Autovacuum launcher, as well as the checkpointer, will not do
    anything while in the WAL-Prohibited server state until someone wakes
    up.  E.g. user might, later on, request us to put the system back to
    read-write by executing ALTER SYSTEM READ WRITE.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. To execute ALTER SYSTEM READ ONLY/WRITE, the user should have execute
    permssion on pg_alter_wal_prohibit_state() function.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 390 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 116 ++++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  39 +++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  26 +-
 src/backend/tcop/utility.c               |  15 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  94 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 22 files changed, 716 insertions(+), 69 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..00c8894d806
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,390 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state generation and the last two bits of
+	 * this generation indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 shared_state_generation;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(uint32 cur_state_gen);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* Check permission for pg_alter_wal_prohibit_state() */
+	if (pg_proc_aclcheck(F_PG_ALTER_WAL_PROHIBIT_STATE,
+						 GetUserId(), ACL_EXECUTE) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for command ALTER SYSTEM"),
+				 errhint("Get execute permission for pg_alter_wal_prohibit_state() to this user.")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Execute function to alter wal prohibit state */
+	(void) OidFunctionCall1(F_PG_ALTER_WAL_PROHIBIT_STATE,
+							BoolGetDatum(stmt->walprohibited));
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+	uint32		cur_state_gen;
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	cur_state_gen = SetWALProhibitState(walprohibited, false);
+
+	/* Server is already in requested state */
+	if (!cur_state_gen)
+		PG_RETURN_VOID();
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange(cur_state_gen);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(uint32 cur_state_gen)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitStateGen() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(cur_state_gen);
+		return;
+	}
+
+	/* Signal checkpointer process */
+	SendsSignalToCheckpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once wal prohibit state generation changes */
+		if (GetWALProhibitStateGen() != cur_state_gen)
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 cur_state_gen)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+	wal_prohibited =
+		(WALPROHIBIT_NEXT_STATE(cur_state_gen) == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Set the final state */
+	(void) SetWALProhibitState(wal_prohibited, true);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (LastCheckPointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetLastCheckPointSkipped(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * GetWALProhibitStateGen()
+ *
+ * Atomically return the current server WAL prohibited state generation.
+ */
+uint32
+GetWALProhibitStateGen(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->shared_state_generation);
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Increments current shared WAL prohibit state generation concerning to
+ * requested state and returns the same.
+ *
+ * For the transition state request where is_final_state is false if the server
+ * desired transition state is the same as the current state which might have
+ * been requested by some other backend and has been proceeded then the current
+ * wal prohibit generation will be returned so that this backend can wait until
+ * the shared wal prohibited generation change for the final state.  And, if the
+ * server is already completely moved to the requested state then the requester
+ * backend doesn't need to wait, in that case, 0 will be returned.
+ *
+ * The final state can only be requested by the checkpointer or by the
+ * single-user so that there will be no chance that the server is already in the
+ * desired final state.
+ */
+uint32
+SetWALProhibitState(bool wal_prohibited, bool is_final_state)
+{
+	uint32		new_state;
+	uint32		cur_state;
+	uint32		cur_state_gen;
+	uint32		next_state_gen;
+
+	/* Get the current state */
+	cur_state_gen = GetWALProhibitStateGen();
+	cur_state = WALPROHIBIT_CURRENT_STATE(cur_state_gen);
+
+	/* Compute new state */
+	if (is_final_state)
+	{
+		/*
+		 * Only checkpointer or single-user can set the final wal prohibit
+		 * state.
+		 */
+		Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+		/*
+		 * There won't be any other process for the final state setting so that
+		 * the next final state will be the desired state.
+		 */
+		Assert(WALPROHIBIT_NEXT_STATE(cur_state) == new_state);
+	}
+	else
+	{
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_GOING_READ_ONLY :
+			WALPROHIBIT_STATE_GOING_READ_WRITE;
+
+		/* Server is already in the requested transition state */
+		if (cur_state == new_state)
+			return cur_state;		/* Wait for state transition completion */
+
+		/* Server is already in requested state */
+		if (WALPROHIBIT_NEXT_STATE(new_state) == cur_state)
+			return 0;		/* No wait is needed */
+
+		/* Prevent concurrent contrary in progress transition state setting */
+		if (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+		}
+	}
+
+	/*
+	 * Update new state generation in share memory only if the state generation
+	 * hasn't changed until now we have checked.
+	 */
+	next_state_gen = cur_state_gen + 1;
+	(void) pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+										  &cur_state_gen, next_state_gen);
+
+	/* To be sure that any later reads of memory happen strictly after this. */
+	pg_memory_barrier();
+
+	return next_state_gen;
+}
+
+/*
+ * WALProhibitStateGenerationInit()
+ *
+ * Initialization of shared wal prohibit state generation.
+ */
+void
+WALProhibitStateGenerationInit(bool wal_prohibited)
+{
+	uint32	new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibitState->shared_state_generation, new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb13..188c299bed9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 79a77ebbfe2..ef373dc4d6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -246,9 +247,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -723,6 +725,11 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastCheckPointSkipped indicates if the last checkpoint is skipped.
+	 */
+	bool		lastCheckPointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -968,6 +975,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -6196,6 +6204,32 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetLastCheckPointSkipped(bool ChkptSkip)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->lastCheckPointSkipped = ChkptSkip;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Return value of lastCheckPointSkipped flag.
+ */
+bool
+LastCheckPointIsSkipped(void)
+{
+	bool	ChkptSkipped;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ChkptSkipped = XLogCtl->lastCheckPointSkipped;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return ChkptSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7708,6 +7742,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateGenerationInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7718,7 +7758,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetLastCheckPointSkipped(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7964,6 +8014,28 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	uint32 		cur_state = WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8179,9 +8251,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8200,9 +8272,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8224,6 +8307,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8513,9 +8602,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8528,6 +8621,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ed4f3f142d8..79da249dd5c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1485,6 +1485,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115f..efee35cbc94 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -659,6 +659,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b61..1d9c46de20a 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e7dcd4f764..e2ff484d367 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -342,6 +343,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state_gen;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -352,6 +354,30 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state_gen = GetWALProhibitStateGen();
+
+		if (wal_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state_gen);
+			continue;
+		}
+		else if (WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+				 WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1333,3 +1359,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendsSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+SendsSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836a..95a738d7f25 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4060,6 +4060,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd6..2d000ec2ff7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 63cb70bcaa4..4f1b67f9d04 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -96,7 +97,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -510,9 +510,9 @@ ProcessProcSignalBarrier(void)
 			 * unconditionally, but it's more efficient to call only the ones
 			 * that might need us to do something based on the flags.
 			 */
-			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
-				&& ProcessBarrierPlaceholder())
-				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);
+			if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_WALPROHIBIT)
+				&& ProcessBarrierWALProhibit())
+				BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_WALPROHIBIT);
 		}
 		PG_CATCH();
 		{
@@ -554,24 +554,6 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 74c2162cd59..05eac206182 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3691,16 +3691,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b842..24113249f67 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12041,4 +12055,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..61836d61844
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,94 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void CompleteWALProhibitChange(uint32 wal_state);
+extern uint32 GetWALProhibitStateGen(void);
+extern uint32 SetWALProhibitState(bool wal_prohibited, bool is_final_state);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateGenerationInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+
+/*
+ * The WAL Prohibit States.
+ *
+ * 	The odd number represents the transition state and whereas the even number
+ * 	represents the final state.  These states can be distinguished by checking
+ * 	the 0th bits aka transition bit.
+ */
+#define	WALPROHIBIT_STATE_READ_WRITE		(uint32) 0	/* WAL permitted */
+#define	WALPROHIBIT_STATE_GOING_READ_ONLY	(uint32) 1
+#define	WALPROHIBIT_STATE_READ_ONLY			(uint32) 2	/* WAL prohibited */
+#define	WALPROHIBIT_STATE_GOING_READ_WRITE	(uint32) 3
+
+/* The transition bit to distinguish states.  */
+#define	WALPROHIBIT_TRANSITION_IN_PROGRESS	((uint32) 1 << 0)
+
+/* Extract last two bits */
+#define	WALPROHIBIT_CURRENT_STATE(stateGeneration)	\
+	((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define	WALPROHIBIT_NEXT_STATE(stateGeneration)	\
+	WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..2bcd37894f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetLastCheckPointSkipped(bool ChkptSkip);
+extern bool LastCheckPointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f48f5fb4d99..41b8fe02b3e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10964,6 +10964,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4142', descr => 'alter system read only state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4b..f9ff2360b35 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -956,6 +956,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..ad5e3ba5724 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void SendsSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index db6f86e02ab..b89f8f027b1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2671,6 +2671,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.22.0

v9-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/octet-stream; name=v9-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload

From 039ff18750b69097941d41a989819d1725bc2d90 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v9 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 +++++--
 src/backend/access/heap/vacuumlazy.c      | 18 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 +++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++++++----
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 27 ++++++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++--
 src/backend/commands/sequence.c           | 16 ++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++
 40 files changed, 463 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index eb96b4bb36d..53d8c9cea28 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 82788a5c367..f31590dcd75 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e53..d91ca2b391c 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1585861a021..4d6052224fa 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2172,6 +2175,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2690,6 +2695,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3442,6 +3449,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3615,6 +3624,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4548,6 +4559,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5339,6 +5352,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5497,6 +5512,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5605,6 +5622,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5721,6 +5740,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5751,6 +5771,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5761,7 +5785,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index bc510e2e9b3..9dcae7d2153 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..1869df5f03f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1482,6 +1487,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1499,7 +1507,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1932,6 +1940,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1939,6 +1948,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1964,7 +1976,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c87..2c3d8aaecbd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1898,13 +1901,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7f392480ac0..8c3fc251a29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1832,6 +1849,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1920,6 +1938,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1971,7 +1993,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2064,6 +2086,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2277,6 +2300,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2356,7 +2383,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a2ce617c8ce..2a42d784f7a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1143,6 +1144,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2927,7 +2930,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 79400604431..68ffea44e0d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 00c8894d806..f2b5bf2871a 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -24,6 +24,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 188c299bed9..abda095e735 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ef373dc4d6b..f83baef244e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1025,7 +1025,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2860,9 +2860,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8821,6 +8823,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8850,6 +8854,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9078,6 +9084,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9235,6 +9243,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9893,7 +9903,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9907,10 +9917,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9932,8 +9942,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1f0e4e01e69..710806143d4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 632b34af610..b01ad5a966a 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e2ff484d367..a12e026ae02 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -942,6 +942,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d309..5bd898833f5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3631,13 +3631,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.22.0

#67

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Robert Haas (#64)

2 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't mind a loop, but that one looks broken. We have to clear the
bit before we call the function that processes that type of barrier.
Otherwise, if we succeed in absorbing the barrier but a new instance
of the same barrier arrives meanwhile, we'll fail to realize that we
need to absorb the new one.

Here's a new version of the patch for allowing errors in
barrier-handling functions and/or rejection of barriers by those
functions. I think this responds to all of the previous review
comments from Andres. Also, here is an 0002 which is a handy bit of
test code that I wrote. It's not for commit, but it is useful for
finding bugs.

In addition to improving 0001 based on the review comments, I also
tried to write a better commit message for it, but it might still be
possible to do better there. It's a bit hard to explain the idea in
the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
with an XID -- and possibly a bunch of sub-XIDs, and possibly while
idle-in-transaction -- can elect to FATAL rather than absorbing the
barrier. I suspect for other barrier types we might have certain
(hopefully short) stretches of code where a barrier of a particular
type can't be absorbed because we're in the middle of doing something
that relies on the previous value of whatever state is protected by
the barrier. Holding off interrupts in those stretches of code would
prevent the barrier from being absorbed, but would also prevent query
cancel, backend termination, and absorption of other barrier types, so
it seems possible that just allowing the barrier-absorption function
for a barrier of that type to just refuse the barrier until after the
backend exits the critical section of code will work out better.

Just for kicks, I tried running 'make installcheck-parallel' while
emitting placeholder barriers every 0.05 s after altering the
barrier-absorption function to always return false, just to see how
ugly that was. In round figures, it made it take 24 s vs. 21 s, so
it's actually not that bad. However, it all depends on how many times
you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
that the effect might be very non-uniform. That is, if you can get the
code to be running a tight loop that does little real work but does
CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
barrier, it will probably suck. Therefore, I'm inclined to think that
the fairly strong cautionary logic in the patch is reasonable, but
perhaps it can be better worded somehow. Thoughts welcome.

I have not rebased the remainder of the patch series over these two.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Allow-for-error-or-refusal-while-absorbing-barriers.patchapplication/octet-stream; name=0001-Allow-for-error-or-refusal-while-absorbing-barriers.patchDownload

From bc586c860138e92216ac22cf6ce8ff521ec7144a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 6 Oct 2020 15:26:43 -0400
Subject: [PATCH 1/2] Allow for error or refusal while absorbing barriers.

Previously, the per-barrier-type functions tasked with absorbing
them were expected to always succeed and never throw an error.
However, that's a bit inconvenient. Further study has revealed that
there are realistic cases where it might not be possible to absorb
a ProcSignalBarrier without terminating the transaction, or even
the whole backend. Similarly, for some barrier types, there might
be other reasons where it's not reasonably possible to absorb the
barrier at certain points in the code, so provide a way for a
per-barrier-type function to reject absorbing the barrier.

Patch by me, reviewed by Andres Freund.

Discussion: http://postgr.es/m/20200908182005.xya7wetdh3pndzim@alap3.anarazel.de
---
 src/backend/storage/ipc/procsignal.c | 128 ++++++++++++++++++++++++---
 1 file changed, 116 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1..abdae58c47 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -87,12 +88,17 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static void ResetProcSignalBarrierBits(uint32 flags);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -394,6 +400,12 @@ WaitForProcSignalBarrier(uint64 generation)
 		volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i];
 		uint64		oldval;
 
+		/*
+		 * It's important that we check only pss_barrierGeneration here and
+		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
+		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * is updated only afterward.
+		 */
 		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		while (oldval < generation)
 		{
@@ -453,7 +465,7 @@ ProcessProcSignalBarrier(void)
 {
 	uint64		local_gen;
 	uint64		shared_gen;
-	uint32		flags;
+	volatile uint32		flags;
 
 	Assert(MyProcSignalSlot);
 
@@ -482,21 +494,95 @@ ProcessProcSignalBarrier(void)
 	 * read of the barrier generation above happens before we atomically
 	 * extract the flags, and that any subsequent state changes happen
 	 * afterward.
+	 *
+	 * NB: In order to avoid race conditions, we must zero pss_barrierCheckMask
+	 * first and only afterwards try to do barrier processing. If we did it
+	 * in the other order, someone could send us another barrier of some
+	 * type right after we called the barrier-processing function but before
+	 * we cleared the bit. We would have no way of knowing that the bit needs
+	 * to stay set in that case, so the need to call the barrier-processing
+	 * function again would just get forgotten. So instead, we tentatively
+	 * clear all the bits and then put back any for which we don't manage
+	 * to successfully absorb the barrier.
 	 */
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		bool	success = true;
+
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			while (flags != 0)
+			{
+				ProcSignalBarrierType	type;
+				bool processed = true;
+
+				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
+				switch (type)
+				{
+					case PROCSIGNAL_BARRIER_PLACEHOLDER:
+						processed = ProcessBarrierPlaceholder();
+						break;
+				}
+
+				/*
+				 * To avoid an infinite loop, we must always unset the bit
+				 * in flags.
+				 */
+				BARRIER_CLEAR_BIT(flags, type);
+
+				/*
+				 * If we failed to process the barrier, reset the shared bit
+				 * so we try again later, and set a flag so that we don't bump
+				 * our generation.
+				 */
+				if (!processed)
+				{
+					ResetProcSignalBarrierBits(((uint32) 1) << type);
+					success = false;
+				}
+			}
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, we'll need to try again later to handle
+			 * that barrier type and any others that haven't been handled yet
+			 * or weren't successfully absorbed.
+			 */
+			ResetProcSignalBarrierBits(flags);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier types were not successfully absorbed, we will have
+		 * to try again later.
+		 */
+		if (!success)
+		{
+			ResetProcSignalBarrierBits(flags);
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +594,20 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
+/*
+ * If it turns out that we couldn't absorb one or more barrier types, either
+ * because the barrier-processing functions returned false or due to an error,
+ * arrange for processing to be retried later.
+ */
 static void
+ResetProcSignalBarrierBits(uint32 flags)
+{
+	pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask, flags);
+	ProcSignalBarrierPending = true;
+	InterruptPending = true;
+}
+
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +617,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.24.3 (Apple Git-128)

0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchapplication/octet-stream; name=0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchDownload

From abf7f6042585c6fc45938fd90f964789eade0d6b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2020 13:04:16 -0400
Subject: [PATCH 2/2] Test module for barriers. NOT FOR COMMIT.

---
 contrib/barrier/Makefile         | 23 ++++++++++++
 contrib/barrier/barrier--1.0.sql | 14 +++++++
 contrib/barrier/barrier.c        | 63 ++++++++++++++++++++++++++++++++
 contrib/barrier/barrier.control  |  5 +++
 4 files changed, 105 insertions(+)
 create mode 100644 contrib/barrier/Makefile
 create mode 100644 contrib/barrier/barrier--1.0.sql
 create mode 100644 contrib/barrier/barrier.c
 create mode 100644 contrib/barrier/barrier.control

diff --git a/contrib/barrier/Makefile b/contrib/barrier/Makefile
new file mode 100644
index 0000000000..71f59f6629
--- /dev/null
+++ b/contrib/barrier/Makefile
@@ -0,0 +1,23 @@
+# contrib/barrier/Makefile
+
+MODULE_big = barrier
+OBJS = \
+	$(WIN32RES) \
+	barrier.o
+
+EXTENSION = barrier
+DATA = barrier--1.0.sql
+PGFILEDESC = "barrier - barrier test code NOT FOR COMMIT"
+
+REGRESS = barrier
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/barrier
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/barrier/barrier--1.0.sql b/contrib/barrier/barrier--1.0.sql
new file mode 100644
index 0000000000..66cae976a9
--- /dev/null
+++ b/contrib/barrier/barrier--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/barrier/barrier--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION barrier" to load this file. \quit
+
+CREATE FUNCTION emit_barrier(barrier_type text, count integer default 1)
+RETURNS void
+AS 'MODULE_PATHNAME', 'emit_barrier'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION wait_barrier(barrier_type text)
+RETURNS void
+AS 'MODULE_PATHNAME', 'wait_barrier'
+LANGUAGE C STRICT;
diff --git a/contrib/barrier/barrier.c b/contrib/barrier/barrier.c
new file mode 100644
index 0000000000..a0b9843992
--- /dev/null
+++ b/contrib/barrier/barrier.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.c
+ *	  emit ProcSignalBarriers for testing purposes
+ *
+ * Copyright (c) 2016-2020, PostgreSQL Global Development Group
+ *
+ *	  contrib/barrier/barrier.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/procsignal.h"
+#include "utils/builtins.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(emit_barrier);
+PG_FUNCTION_INFO_V1(wait_barrier);
+
+static ProcSignalBarrierType
+get_barrier_type(text *barrier_type)
+{
+	char	   *btype = text_to_cstring(barrier_type);
+
+	if (strcmp(btype, "placeholder") == 0)
+		return PROCSIGNAL_BARRIER_PLACEHOLDER;
+
+	elog(ERROR, "unknown barrier type: \"%s\"", btype);
+}
+
+Datum
+emit_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	int32		count = PG_GETARG_INT32(1);
+	int32		i;
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+
+	for (i = 0; i < count; ++i)
+	{
+		CHECK_FOR_INTERRUPTS();
+		EmitProcSignalBarrier(t);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+wait_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+	uint64		generation;
+
+	generation = EmitProcSignalBarrier(t);
+	elog(NOTICE, "waiting for barrier");
+	WaitForProcSignalBarrier(generation);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/barrier/barrier.control b/contrib/barrier/barrier.control
new file mode 100644
index 0000000000..425ffc1543
--- /dev/null
+++ b/contrib/barrier/barrier.control
@@ -0,0 +1,5 @@
+# barrier extension
+comment = 'emit ProcSignalBarrier for test purposes'
+default_version = '1.0'
+module_pathname = '$libdir/barrier'
+relocatable = true
-- 
2.24.3 (Apple Git-128)

#68

Amul Sul

sulamul@gmail.com

over 5 years ago

In reply to: Robert Haas (#67)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Oct 7, 2020 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't mind a loop, but that one looks broken. We have to clear the
bit before we call the function that processes that type of barrier.
Otherwise, if we succeed in absorbing the barrier but a new instance
of the same barrier arrives meanwhile, we'll fail to realize that we
need to absorb the new one.

Here's a new version of the patch for allowing errors in
barrier-handling functions and/or rejection of barriers by those
functions. I think this responds to all of the previous review
comments from Andres. Also, here is an 0002 which is a handy bit of
test code that I wrote. It's not for commit, but it is useful for
finding bugs.

In addition to improving 0001 based on the review comments, I also
tried to write a better commit message for it, but it might still be
possible to do better there. It's a bit hard to explain the idea in
the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
with an XID -- and possibly a bunch of sub-XIDs, and possibly while
idle-in-transaction -- can elect to FATAL rather than absorbing the
barrier. I suspect for other barrier types we might have certain
(hopefully short) stretches of code where a barrier of a particular
type can't be absorbed because we're in the middle of doing something
that relies on the previous value of whatever state is protected by
the barrier. Holding off interrupts in those stretches of code would
prevent the barrier from being absorbed, but would also prevent query
cancel, backend termination, and absorption of other barrier types, so
it seems possible that just allowing the barrier-absorption function
for a barrier of that type to just refuse the barrier until after the
backend exits the critical section of code will work out better.

Just for kicks, I tried running 'make installcheck-parallel' while
emitting placeholder barriers every 0.05 s after altering the
barrier-absorption function to always return false, just to see how
ugly that was. In round figures, it made it take 24 s vs. 21 s, so
it's actually not that bad. However, it all depends on how many times
you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
that the effect might be very non-uniform. That is, if you can get the
code to be running a tight loop that does little real work but does
CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
barrier, it will probably suck. Therefore, I'm inclined to think that
the fairly strong cautionary logic in the patch is reasonable, but
perhaps it can be better worded somehow. Thoughts welcome.

I have not rebased the remainder of the patch series over these two.

That I'll do.

On a quick look at the latest 0001 patch, the following hunk to reset leftover
flags seems to be unnecessary:

+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }

When the ProcessBarrierPlaceholder() function returns false without an error,
that barrier flag gets reset within the while loop. The case when it has an
error, the rest of the flags get reset in the catch block. Correct me if I am
missing something here.

Regards,
Amul

#69

Amul Sul

sulamul@gmail.com

about 5 years ago

In reply to: Amul Sul (#68)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Oct 8, 2020 at 3:52 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Oct 7, 2020 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't mind a loop, but that one looks broken. We have to clear the
bit before we call the function that processes that type of barrier.
Otherwise, if we succeed in absorbing the barrier but a new instance
of the same barrier arrives meanwhile, we'll fail to realize that we
need to absorb the new one.

Here's a new version of the patch for allowing errors in
barrier-handling functions and/or rejection of barriers by those
functions. I think this responds to all of the previous review
comments from Andres. Also, here is an 0002 which is a handy bit of
test code that I wrote. It's not for commit, but it is useful for
finding bugs.

In addition to improving 0001 based on the review comments, I also
tried to write a better commit message for it, but it might still be
possible to do better there. It's a bit hard to explain the idea in
the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
with an XID -- and possibly a bunch of sub-XIDs, and possibly while
idle-in-transaction -- can elect to FATAL rather than absorbing the
barrier. I suspect for other barrier types we might have certain
(hopefully short) stretches of code where a barrier of a particular
type can't be absorbed because we're in the middle of doing something
that relies on the previous value of whatever state is protected by
the barrier. Holding off interrupts in those stretches of code would
prevent the barrier from being absorbed, but would also prevent query
cancel, backend termination, and absorption of other barrier types, so
it seems possible that just allowing the barrier-absorption function
for a barrier of that type to just refuse the barrier until after the
backend exits the critical section of code will work out better.

Just for kicks, I tried running 'make installcheck-parallel' while
emitting placeholder barriers every 0.05 s after altering the
barrier-absorption function to always return false, just to see how
ugly that was. In round figures, it made it take 24 s vs. 21 s, so
it's actually not that bad. However, it all depends on how many times
you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
that the effect might be very non-uniform. That is, if you can get the
code to be running a tight loop that does little real work but does
CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
barrier, it will probably suck. Therefore, I'm inclined to think that
the fairly strong cautionary logic in the patch is reasonable, but
perhaps it can be better worded somehow. Thoughts welcome.

I have not rebased the remainder of the patch series over these two.

That I'll do.

Attaching a rebased version includes Robert's patches for the latest master
head.

On a quick look at the latest 0001 patch, the following hunk to reset leftover
flags seems to be unnecessary:
+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }
When the ProcessBarrierPlaceholder() function returns false without an error,
that barrier flag gets reset within the while loop. The case when it has an
error, the rest of the flags get reset in the catch block. Correct me if I am
missing something here.

Robert, could you please confirm this?

Regards,
Amul

Attachments:

v10-0001-Allow-for-error-or-refusal-while-absorbing-barri.patchapplication/x-patch; name=v10-0001-Allow-for-error-or-refusal-while-absorbing-barri.patchDownload

From f71b0841e8403fdc7f3ccb203eccebad15eabc0f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 6 Oct 2020 15:26:43 -0400
Subject: [PATCH v10 1/6] Allow for error or refusal while absorbing barriers.

Previously, the per-barrier-type functions tasked with absorbing
them were expected to always succeed and never throw an error.
However, that's a bit inconvenient. Further study has revealed that
there are realistic cases where it might not be possible to absorb
a ProcSignalBarrier without terminating the transaction, or even
the whole backend. Similarly, for some barrier types, there might
be other reasons where it's not reasonably possible to absorb the
barrier at certain points in the code, so provide a way for a
per-barrier-type function to reject absorbing the barrier.

Patch by me, reviewed by Andres Freund.

Discussion: http://postgr.es/m/20200908182005.xya7wetdh3pndzim@alap3.anarazel.de
---
 src/backend/storage/ipc/procsignal.c | 128 ++++++++++++++++++++++++---
 1 file changed, 116 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1c..abdae58c476 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -87,12 +88,17 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static void ResetProcSignalBarrierBits(uint32 flags);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -394,6 +400,12 @@ WaitForProcSignalBarrier(uint64 generation)
 		volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i];
 		uint64		oldval;
 
+		/*
+		 * It's important that we check only pss_barrierGeneration here and
+		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
+		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * is updated only afterward.
+		 */
 		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		while (oldval < generation)
 		{
@@ -453,7 +465,7 @@ ProcessProcSignalBarrier(void)
 {
 	uint64		local_gen;
 	uint64		shared_gen;
-	uint32		flags;
+	volatile uint32		flags;
 
 	Assert(MyProcSignalSlot);
 
@@ -482,21 +494,95 @@ ProcessProcSignalBarrier(void)
 	 * read of the barrier generation above happens before we atomically
 	 * extract the flags, and that any subsequent state changes happen
 	 * afterward.
+	 *
+	 * NB: In order to avoid race conditions, we must zero pss_barrierCheckMask
+	 * first and only afterwards try to do barrier processing. If we did it
+	 * in the other order, someone could send us another barrier of some
+	 * type right after we called the barrier-processing function but before
+	 * we cleared the bit. We would have no way of knowing that the bit needs
+	 * to stay set in that case, so the need to call the barrier-processing
+	 * function again would just get forgotten. So instead, we tentatively
+	 * clear all the bits and then put back any for which we don't manage
+	 * to successfully absorb the barrier.
 	 */
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		bool	success = true;
+
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			while (flags != 0)
+			{
+				ProcSignalBarrierType	type;
+				bool processed = true;
+
+				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
+				switch (type)
+				{
+					case PROCSIGNAL_BARRIER_PLACEHOLDER:
+						processed = ProcessBarrierPlaceholder();
+						break;
+				}
+
+				/*
+				 * To avoid an infinite loop, we must always unset the bit
+				 * in flags.
+				 */
+				BARRIER_CLEAR_BIT(flags, type);
+
+				/*
+				 * If we failed to process the barrier, reset the shared bit
+				 * so we try again later, and set a flag so that we don't bump
+				 * our generation.
+				 */
+				if (!processed)
+				{
+					ResetProcSignalBarrierBits(((uint32) 1) << type);
+					success = false;
+				}
+			}
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, we'll need to try again later to handle
+			 * that barrier type and any others that haven't been handled yet
+			 * or weren't successfully absorbed.
+			 */
+			ResetProcSignalBarrierBits(flags);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier types were not successfully absorbed, we will have
+		 * to try again later.
+		 */
+		if (!success)
+		{
+			ResetProcSignalBarrierBits(flags);
+			return;
+		}
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +594,20 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
+/*
+ * If it turns out that we couldn't absorb one or more barrier types, either
+ * because the barrier-processing functions returned false or due to an error,
+ * arrange for processing to be retried later.
+ */
 static void
+ResetProcSignalBarrierBits(uint32 flags)
+{
+	pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask, flags);
+	ProcSignalBarrierPending = true;
+	InterruptPending = true;
+}
+
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +617,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.18.0

v10-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v10-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 9f19c93f11de7111b9c85b3290414062f8f6e8ca Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v10 5/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 +++++--
 src/backend/access/heap/vacuumlazy.c      | 18 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 +++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++++++----
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 27 ++++++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++--
 src/backend/commands/sequence.c           | 16 ++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++
 40 files changed, 463 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index eb96b4bb36d..53d8c9cea28 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 82788a5c367..f31590dcd75 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e53..d91ca2b391c 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1585861a021..4d6052224fa 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2172,6 +2175,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2690,6 +2695,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3442,6 +3449,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3615,6 +3624,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4548,6 +4559,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5339,6 +5352,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5497,6 +5512,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5605,6 +5622,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5721,6 +5740,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5751,6 +5771,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5761,7 +5785,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index bc510e2e9b3..9dcae7d2153 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..1869df5f03f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1482,6 +1487,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1499,7 +1507,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1932,6 +1940,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1939,6 +1948,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1964,7 +1976,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17e..b519a1268e8 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -271,6 +272,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c87..2c3d8aaecbd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1246,6 +1247,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1898,13 +1901,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7f392480ac0..8c3fc251a29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1235,7 +1250,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1302,6 +1317,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1832,6 +1849,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1920,6 +1938,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1971,7 +1993,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2064,6 +2086,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2277,6 +2300,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2356,7 +2383,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 43653fe5721..95cb3a94cf1 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 79400604431..68ffea44e0d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 00c8894d806..f2b5bf2871a 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -24,6 +24,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 188c299bed9..abda095e735 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8c51c554dc0..5d0faee6989 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1026,7 +1026,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2862,9 +2862,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8823,6 +8825,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8852,6 +8856,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9080,6 +9086,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9237,6 +9245,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9895,7 +9905,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9909,10 +9919,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9934,8 +9944,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1f0e4e01e69..710806143d4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 632b34af610..b01ad5a966a 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bd66705a9af..da4b8d502ad 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -945,6 +945,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2fa0b065a28..38bce5c5ec7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3633,13 +3633,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v10-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchapplication/x-patch; name=v10-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchDownload

From 9e38853c92c49f2f45f05c8730e3de1dc2c0e80a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2020 13:04:16 -0400
Subject: [PATCH v10 2/6] Test module for barriers. NOT FOR COMMIT.

---
 contrib/barrier/Makefile         | 23 ++++++++++++
 contrib/barrier/barrier--1.0.sql | 14 +++++++
 contrib/barrier/barrier.c        | 63 ++++++++++++++++++++++++++++++++
 contrib/barrier/barrier.control  |  5 +++
 4 files changed, 105 insertions(+)
 create mode 100644 contrib/barrier/Makefile
 create mode 100644 contrib/barrier/barrier--1.0.sql
 create mode 100644 contrib/barrier/barrier.c
 create mode 100644 contrib/barrier/barrier.control

diff --git a/contrib/barrier/Makefile b/contrib/barrier/Makefile
new file mode 100644
index 00000000000..71f59f6629e
--- /dev/null
+++ b/contrib/barrier/Makefile
@@ -0,0 +1,23 @@
+# contrib/barrier/Makefile
+
+MODULE_big = barrier
+OBJS = \
+	$(WIN32RES) \
+	barrier.o
+
+EXTENSION = barrier
+DATA = barrier--1.0.sql
+PGFILEDESC = "barrier - barrier test code NOT FOR COMMIT"
+
+REGRESS = barrier
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/barrier
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/barrier/barrier--1.0.sql b/contrib/barrier/barrier--1.0.sql
new file mode 100644
index 00000000000..66cae976a96
--- /dev/null
+++ b/contrib/barrier/barrier--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/barrier/barrier--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION barrier" to load this file. \quit
+
+CREATE FUNCTION emit_barrier(barrier_type text, count integer default 1)
+RETURNS void
+AS 'MODULE_PATHNAME', 'emit_barrier'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION wait_barrier(barrier_type text)
+RETURNS void
+AS 'MODULE_PATHNAME', 'wait_barrier'
+LANGUAGE C STRICT;
diff --git a/contrib/barrier/barrier.c b/contrib/barrier/barrier.c
new file mode 100644
index 00000000000..a0b98439924
--- /dev/null
+++ b/contrib/barrier/barrier.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.c
+ *	  emit ProcSignalBarriers for testing purposes
+ *
+ * Copyright (c) 2016-2020, PostgreSQL Global Development Group
+ *
+ *	  contrib/barrier/barrier.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/procsignal.h"
+#include "utils/builtins.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(emit_barrier);
+PG_FUNCTION_INFO_V1(wait_barrier);
+
+static ProcSignalBarrierType
+get_barrier_type(text *barrier_type)
+{
+	char	   *btype = text_to_cstring(barrier_type);
+
+	if (strcmp(btype, "placeholder") == 0)
+		return PROCSIGNAL_BARRIER_PLACEHOLDER;
+
+	elog(ERROR, "unknown barrier type: \"%s\"", btype);
+}
+
+Datum
+emit_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	int32		count = PG_GETARG_INT32(1);
+	int32		i;
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+
+	for (i = 0; i < count; ++i)
+	{
+		CHECK_FOR_INTERRUPTS();
+		EmitProcSignalBarrier(t);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+wait_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+	uint64		generation;
+
+	generation = EmitProcSignalBarrier(t);
+	elog(NOTICE, "waiting for barrier");
+	WaitForProcSignalBarrier(generation);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/barrier/barrier.control b/contrib/barrier/barrier.control
new file mode 100644
index 00000000000..425ffc15432
--- /dev/null
+++ b/contrib/barrier/barrier.control
@@ -0,0 +1,5 @@
+# barrier extension
+comment = 'emit ProcSignalBarrier for test purposes'
+default_version = '1.0'
+module_pathname = '$libdir/barrier'
+relocatable = true
-- 
2.18.0

v10-0003-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v10-0003-Add-alter-system-read-only-write-syntax.patchDownload

From 49b1140498d17eb49c18cc385230e13664b12f52 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v10 3/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 21 +++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 2b4d7654cc7..e1c6b2364b4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4018,6 +4018,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(walprohibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5403,6 +5412,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e2d1b987bf4..5f5f289b8af 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1769,6 +1769,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(walprohibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3457,6 +3463,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 08a049232e0..af74a781782 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1356,6 +1356,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(walprohibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3912,6 +3921,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ab7b535caae..d5acb438f6d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2550,6 +2550,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(walprohibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2872,6 +2885,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 480d1683468..5f306456231 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10204,8 +10205,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->walprohibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9a35147b26a..74c2162cd59 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2819,6 +2826,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3683,3 +3691,16 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index b2b4f1fd4d1..46b8ccbdd6f 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1864,9 +1864,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7ddd8c011bf..7b233925692 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -411,6 +411,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 60c2f454660..340ee87f1bc 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3196,6 +3196,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		walprohibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6acade6c67..4f524a36f80 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.18.0

v10-0004-Implement-ALTER-SYSTEM-READ-ONLY-using-global-ba.patchapplication/x-patch; name=v10-0004-Implement-ALTER-SYSTEM-READ-ONLY-using-global-ba.patchDownload

From 7b05eeb593d6508715500fee2712040210815844 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v10 4/6] Implement ALTER SYSTEM READ ONLY using global
 barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command or by calling
    pg_alter_wal_prohibit_state(true) sql function, the current state
    generation to inprogress in shared memory marked and signaled
    checkpointer process.  Checkpointer, noticing that the current state
    generation has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.
 5. The Autovacuum launcher, as well as the checkpointer, will not do
    anything while in the WAL-Prohibited server state until someone wakes
    up.  E.g. user might, later on, request us to put the system back to
    read-write by executing ALTER SYSTEM READ WRITE.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. To execute ALTER SYSTEM READ ONLY/WRITE, the user should have execute
    permssion on pg_alter_wal_prohibit_state() function.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 390 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 116 ++++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  39 +++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/tcop/utility.c               |  15 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  94 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 22 files changed, 715 insertions(+), 68 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..00c8894d806
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,390 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state generation and the last two bits of
+	 * this generation indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 shared_state_generation;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(uint32 cur_state_gen);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* Check permission for pg_alter_wal_prohibit_state() */
+	if (pg_proc_aclcheck(F_PG_ALTER_WAL_PROHIBIT_STATE,
+						 GetUserId(), ACL_EXECUTE) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for command ALTER SYSTEM"),
+				 errhint("Get execute permission for pg_alter_wal_prohibit_state() to this user.")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Execute function to alter wal prohibit state */
+	(void) OidFunctionCall1(F_PG_ALTER_WAL_PROHIBIT_STATE,
+							BoolGetDatum(stmt->walprohibited));
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+	uint32		cur_state_gen;
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	cur_state_gen = SetWALProhibitState(walprohibited, false);
+
+	/* Server is already in requested state */
+	if (!cur_state_gen)
+		PG_RETURN_VOID();
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange(cur_state_gen);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(uint32 cur_state_gen)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitStateGen() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(cur_state_gen);
+		return;
+	}
+
+	/* Signal checkpointer process */
+	SendsSignalToCheckpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once wal prohibit state generation changes */
+		if (GetWALProhibitStateGen() != cur_state_gen)
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 cur_state_gen)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+	wal_prohibited =
+		(WALPROHIBIT_NEXT_STATE(cur_state_gen) == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Set the final state */
+	(void) SetWALProhibitState(wal_prohibited, true);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (LastCheckPointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetLastCheckPointSkipped(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * GetWALProhibitStateGen()
+ *
+ * Atomically return the current server WAL prohibited state generation.
+ */
+uint32
+GetWALProhibitStateGen(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->shared_state_generation);
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Increments current shared WAL prohibit state generation concerning to
+ * requested state and returns the same.
+ *
+ * For the transition state request where is_final_state is false if the server
+ * desired transition state is the same as the current state which might have
+ * been requested by some other backend and has been proceeded then the current
+ * wal prohibit generation will be returned so that this backend can wait until
+ * the shared wal prohibited generation change for the final state.  And, if the
+ * server is already completely moved to the requested state then the requester
+ * backend doesn't need to wait, in that case, 0 will be returned.
+ *
+ * The final state can only be requested by the checkpointer or by the
+ * single-user so that there will be no chance that the server is already in the
+ * desired final state.
+ */
+uint32
+SetWALProhibitState(bool wal_prohibited, bool is_final_state)
+{
+	uint32		new_state;
+	uint32		cur_state;
+	uint32		cur_state_gen;
+	uint32		next_state_gen;
+
+	/* Get the current state */
+	cur_state_gen = GetWALProhibitStateGen();
+	cur_state = WALPROHIBIT_CURRENT_STATE(cur_state_gen);
+
+	/* Compute new state */
+	if (is_final_state)
+	{
+		/*
+		 * Only checkpointer or single-user can set the final wal prohibit
+		 * state.
+		 */
+		Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+		/*
+		 * There won't be any other process for the final state setting so that
+		 * the next final state will be the desired state.
+		 */
+		Assert(WALPROHIBIT_NEXT_STATE(cur_state) == new_state);
+	}
+	else
+	{
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_GOING_READ_ONLY :
+			WALPROHIBIT_STATE_GOING_READ_WRITE;
+
+		/* Server is already in the requested transition state */
+		if (cur_state == new_state)
+			return cur_state;		/* Wait for state transition completion */
+
+		/* Server is already in requested state */
+		if (WALPROHIBIT_NEXT_STATE(new_state) == cur_state)
+			return 0;		/* No wait is needed */
+
+		/* Prevent concurrent contrary in progress transition state setting */
+		if (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+		}
+	}
+
+	/*
+	 * Update new state generation in share memory only if the state generation
+	 * hasn't changed until now we have checked.
+	 */
+	next_state_gen = cur_state_gen + 1;
+	(void) pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+										  &cur_state_gen, next_state_gen);
+
+	/* To be sure that any later reads of memory happen strictly after this. */
+	pg_memory_barrier();
+
+	return next_state_gen;
+}
+
+/*
+ * WALProhibitStateGenerationInit()
+ *
+ * Initialization of shared wal prohibit state generation.
+ */
+void
+WALProhibitStateGenerationInit(bool wal_prohibited)
+{
+	uint32	new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibitState->shared_state_generation, new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb13..188c299bed9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b11701..8c51c554dc0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -246,9 +247,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -723,6 +725,11 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastCheckPointSkipped indicates if the last checkpoint is skipped.
+	 */
+	bool		lastCheckPointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -969,6 +976,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -6198,6 +6206,32 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetLastCheckPointSkipped(bool ChkptSkip)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->lastCheckPointSkipped = ChkptSkip;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Return value of lastCheckPointSkipped flag.
+ */
+bool
+LastCheckPointIsSkipped(void)
+{
+	bool	ChkptSkipped;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ChkptSkipped = XLogCtl->lastCheckPointSkipped;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return ChkptSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7710,6 +7744,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateGenerationInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7720,7 +7760,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetLastCheckPointSkipped(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7966,6 +8016,28 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	uint32 		cur_state = WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8181,9 +8253,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8202,9 +8274,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8226,6 +8309,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8515,9 +8604,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8530,6 +8623,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c6dd084fbcc..c6c2b3b6332 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1510,6 +1510,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115f..efee35cbc94 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -659,6 +659,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b61..1d9c46de20a 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef4..bd66705a9af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -342,6 +343,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state_gen;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -352,6 +354,30 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state_gen = GetWALProhibitStateGen();
+
+		if (wal_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state_gen);
+			continue;
+		}
+		else if (WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+				 WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1336,3 +1362,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendsSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+SendsSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 822f0ebc628..f020ff7e5a0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4203,6 +4203,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd6..2d000ec2ff7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index abdae58c476..ab5b5c888fb 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -607,24 +607,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 74c2162cd59..05eac206182 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3691,16 +3691,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa47..9c6d89627eb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12069,4 +12083,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..61836d61844
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,94 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void CompleteWALProhibitChange(uint32 wal_state);
+extern uint32 GetWALProhibitStateGen(void);
+extern uint32 SetWALProhibitState(bool wal_prohibited, bool is_final_state);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateGenerationInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+
+/*
+ * The WAL Prohibit States.
+ *
+ * 	The odd number represents the transition state and whereas the even number
+ * 	represents the final state.  These states can be distinguished by checking
+ * 	the 0th bits aka transition bit.
+ */
+#define	WALPROHIBIT_STATE_READ_WRITE		(uint32) 0	/* WAL permitted */
+#define	WALPROHIBIT_STATE_GOING_READ_ONLY	(uint32) 1
+#define	WALPROHIBIT_STATE_READ_ONLY			(uint32) 2	/* WAL prohibited */
+#define	WALPROHIBIT_STATE_GOING_READ_WRITE	(uint32) 3
+
+/* The transition bit to distinguish states.  */
+#define	WALPROHIBIT_TRANSITION_IN_PROGRESS	((uint32) 1 << 0)
+
+/* Extract last two bits */
+#define	WALPROHIBIT_CURRENT_STATE(stateGeneration)	\
+	((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define	WALPROHIBIT_NEXT_STATE(stateGeneration)	\
+	WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..2bcd37894f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetLastCheckPointSkipped(bool ChkptSkip);
+extern bool LastCheckPointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a66870bcc08..5759594e157 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10991,6 +10991,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4142', descr => 'alter system read only state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a821ff4f158..b226592db90 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1021,6 +1021,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..ad5e3ba5724 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void SendsSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4f524a36f80..aa8f61b3c98 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2679,6 +2679,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v10-0006-WIP-Documentation.patchapplication/x-patch; name=v10-0006-WIP-Documentation.patchDownload

From 2badd4eaa98f517eaa4fed5456cb39fad3c07ae2 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v10 6/6] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#70

Robert Haas

robertmhaas@gmail.com

about 5 years ago

In reply to: Amul Sul (#68)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:

On a quick look at the latest 0001 patch, the following hunk to reset leftover
flags seems to be unnecessary:
+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }
When the ProcessBarrierPlaceholder() function returns false without an error,
that barrier flag gets reset within the while loop. The case when it has an
error, the rest of the flags get reset in the catch block. Correct me if I am
missing something here.

Good catch. I think you're right. Do you want to update accordingly?

Andres, do you like the new loop better?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#71

Amul Sul

sulamul@gmail.com

about 5 years ago

In reply to: Robert Haas (#70)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:

On a quick look at the latest 0001 patch, the following hunk to reset

leftover
flags seems to be unnecessary:
+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }
When the ProcessBarrierPlaceholder() function returns false without an
error,

that barrier flag gets reset within the while loop. The case when it

has an

error, the rest of the flags get reset in the catch block. Correct me

if I am

missing something here.

Good catch. I think you're right. Do you want to update accordingly?

Sure, Ill update that. Thanks for the confirmation.

Show quoted text

Andres, do you like the new loop better?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72

Amul Sul

sulamul@gmail.com

about 5 years ago

In reply to: Amul Sul (#71)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Nov 20, 2020 at 11:13 PM Amul Sul <sulamul@gmail.com> wrote:

On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:
On a quick look at the latest 0001 patch, the following hunk to reset leftover
flags seems to be unnecessary:
+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }
When the ProcessBarrierPlaceholder() function returns false without an error,
that barrier flag gets reset within the while loop. The case when it has an
error, the rest of the flags get reset in the catch block. Correct me if I am
missing something here.
Good catch. I think you're right. Do you want to update accordingly?
Sure, Ill update that. Thanks for the confirmation.

Attached is the updated version where unnecessary ResetProcSignalBarrierBits()
call in 0001 patch is removed. The rest of the patches are unchanged, thanks.

Andres, do you like the new loop better?

Regards,
Amul

Attachments:

v11-0004-Implement-ALTER-SYSTEM-READ-ONLY-using-global-ba.patchapplication/octet-stream; name=v11-0004-Implement-ALTER-SYSTEM-READ-ONLY-using-global-ba.patchDownload

From c46845a9b01087808d13d9013915d6f301a7699b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v11 4/6] Implement ALTER SYSTEM READ ONLY using global
 barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command or by calling
    pg_alter_wal_prohibit_state(true) sql function, the current state
    generation to inprogress in shared memory marked and signaled
    checkpointer process.  Checkpointer, noticing that the current state
    generation has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.
 5. The Autovacuum launcher, as well as the checkpointer, will not do
    anything while in the WAL-Prohibited server state until someone wakes
    up.  E.g. user might, later on, request us to put the system back to
    read-write by executing ALTER SYSTEM READ WRITE.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. To execute ALTER SYSTEM READ ONLY/WRITE, the user should have execute
    permssion on pg_alter_wal_prohibit_state() function.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 390 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 116 ++++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  39 +++
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/tcop/utility.c               |  15 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  94 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 22 files changed, 715 insertions(+), 68 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..00c8894d806
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,390 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state generation and the last two bits of
+	 * this generation indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 shared_state_generation;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(uint32 cur_state_gen);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* Check permission for pg_alter_wal_prohibit_state() */
+	if (pg_proc_aclcheck(F_PG_ALTER_WAL_PROHIBIT_STATE,
+						 GetUserId(), ACL_EXECUTE) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for command ALTER SYSTEM"),
+				 errhint("Get execute permission for pg_alter_wal_prohibit_state() to this user.")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Execute function to alter wal prohibit state */
+	(void) OidFunctionCall1(F_PG_ALTER_WAL_PROHIBIT_STATE,
+							BoolGetDatum(stmt->walprohibited));
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+	uint32		cur_state_gen;
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	cur_state_gen = SetWALProhibitState(walprohibited, false);
+
+	/* Server is already in requested state */
+	if (!cur_state_gen)
+		PG_RETURN_VOID();
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange(cur_state_gen);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(uint32 cur_state_gen)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitStateGen() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(cur_state_gen);
+		return;
+	}
+
+	/* Signal checkpointer process */
+	SendsSignalToCheckpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once wal prohibit state generation changes */
+		if (GetWALProhibitStateGen() != cur_state_gen)
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Checkpointer will call this to complete the requested WAL prohibit state
+ * transition.
+ */
+void
+CompleteWALProhibitChange(uint32 cur_state_gen)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+	wal_prohibited =
+		(WALPROHIBIT_NEXT_STATE(cur_state_gen) == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Set the final state */
+	(void) SetWALProhibitState(wal_prohibited, true);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (LastCheckPointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetLastCheckPointSkipped(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * GetWALProhibitStateGen()
+ *
+ * Atomically return the current server WAL prohibited state generation.
+ */
+uint32
+GetWALProhibitStateGen(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->shared_state_generation);
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Increments current shared WAL prohibit state generation concerning to
+ * requested state and returns the same.
+ *
+ * For the transition state request where is_final_state is false if the server
+ * desired transition state is the same as the current state which might have
+ * been requested by some other backend and has been proceeded then the current
+ * wal prohibit generation will be returned so that this backend can wait until
+ * the shared wal prohibited generation change for the final state.  And, if the
+ * server is already completely moved to the requested state then the requester
+ * backend doesn't need to wait, in that case, 0 will be returned.
+ *
+ * The final state can only be requested by the checkpointer or by the
+ * single-user so that there will be no chance that the server is already in the
+ * desired final state.
+ */
+uint32
+SetWALProhibitState(bool wal_prohibited, bool is_final_state)
+{
+	uint32		new_state;
+	uint32		cur_state;
+	uint32		cur_state_gen;
+	uint32		next_state_gen;
+
+	/* Get the current state */
+	cur_state_gen = GetWALProhibitStateGen();
+	cur_state = WALPROHIBIT_CURRENT_STATE(cur_state_gen);
+
+	/* Compute new state */
+	if (is_final_state)
+	{
+		/*
+		 * Only checkpointer or single-user can set the final wal prohibit
+		 * state.
+		 */
+		Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+		/*
+		 * There won't be any other process for the final state setting so that
+		 * the next final state will be the desired state.
+		 */
+		Assert(WALPROHIBIT_NEXT_STATE(cur_state) == new_state);
+	}
+	else
+	{
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_GOING_READ_ONLY :
+			WALPROHIBIT_STATE_GOING_READ_WRITE;
+
+		/* Server is already in the requested transition state */
+		if (cur_state == new_state)
+			return cur_state;		/* Wait for state transition completion */
+
+		/* Server is already in requested state */
+		if (WALPROHIBIT_NEXT_STATE(new_state) == cur_state)
+			return 0;		/* No wait is needed */
+
+		/* Prevent concurrent contrary in progress transition state setting */
+		if (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+		}
+	}
+
+	/*
+	 * Update new state generation in share memory only if the state generation
+	 * hasn't changed until now we have checked.
+	 */
+	next_state_gen = cur_state_gen + 1;
+	(void) pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+										  &cur_state_gen, next_state_gen);
+
+	/* To be sure that any later reads of memory happen strictly after this. */
+	pg_memory_barrier();
+
+	return next_state_gen;
+}
+
+/*
+ * WALProhibitStateGenerationInit()
+ *
+ * Initialization of shared wal prohibit state generation.
+ */
+void
+WALProhibitStateGenerationInit(bool wal_prohibited)
+{
+	uint32	new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibitState->shared_state_generation, new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7eaa..485e441225b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13f1d8c3dc7..659376bb327 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -246,9 +247,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -723,6 +725,11 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastCheckPointSkipped indicates if the last checkpoint is skipped.
+	 */
+	bool		lastCheckPointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -969,6 +976,7 @@ static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static inline bool IsWALProhibited(void);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -6191,6 +6199,32 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetLastCheckPointSkipped(bool ChkptSkip)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->lastCheckPointSkipped = ChkptSkip;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Return value of lastCheckPointSkipped flag.
+ */
+bool
+LastCheckPointIsSkipped(void)
+{
+	bool	ChkptSkipped;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ChkptSkipped = XLogCtl->lastCheckPointSkipped;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return ChkptSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7704,6 +7738,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateGenerationInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7714,7 +7754,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetLastCheckPointSkipped(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7960,6 +8010,28 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+static inline bool
+IsWALProhibited(void)
+{
+	uint32 		cur_state = WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8175,9 +8247,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8196,9 +8268,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8220,6 +8303,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8509,9 +8598,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8524,6 +8617,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2e4aa1c4b66..717104f0216 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1504,6 +1504,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index aa5b97fbacb..9c553804d96 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -659,6 +659,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b61..1d9c46de20a 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef4..bd66705a9af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -342,6 +343,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		uint32		wal_state_gen;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -352,6 +354,30 @@ CheckpointerMain(void)
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
 
+		wal_state_gen = GetWALProhibitStateGen();
+
+		if (wal_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			/* Complete WAL prohibit state change request */
+			CompleteWALProhibitChange(wal_state_gen);
+			continue;
+		}
+		else if (WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+				 WALPROHIBIT_STATE_READ_ONLY)
+		{
+			/*
+			 * Don't do anything until someone wakes us up.  For example a
+			 * backend might later on request us to put the system back to
+			 * read-write wal prohibit sate.
+			 */
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_CHECKPOINTER_MAIN);
+			continue;
+		}
+
+		Assert(WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
 		 * word in shared memory is nonzero.  We shouldn't need to acquire the
@@ -1336,3 +1362,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendsSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+SendsSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b2..3fb5dba318a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4206,6 +4206,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd6..2d000ec2ff7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 9c3ebccd3e2..b98f88122b6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -604,24 +604,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f68b11d0af1..413913457f4 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -85,7 +86,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3677,16 +3677,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e4..888015c5355 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -225,6 +225,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -615,6 +616,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2036,6 +2038,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12069,4 +12083,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..61836d61844
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,94 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void CompleteWALProhibitChange(uint32 wal_state);
+extern uint32 GetWALProhibitStateGen(void);
+extern uint32 SetWALProhibitState(bool wal_prohibited, bool is_final_state);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateGenerationInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+
+/*
+ * The WAL Prohibit States.
+ *
+ * 	The odd number represents the transition state and whereas the even number
+ * 	represents the final state.  These states can be distinguished by checking
+ * 	the 0th bits aka transition bit.
+ */
+#define	WALPROHIBIT_STATE_READ_WRITE		(uint32) 0	/* WAL permitted */
+#define	WALPROHIBIT_STATE_GOING_READ_ONLY	(uint32) 1
+#define	WALPROHIBIT_STATE_READ_ONLY			(uint32) 2	/* WAL prohibited */
+#define	WALPROHIBIT_STATE_GOING_READ_WRITE	(uint32) 3
+
+/* The transition bit to distinguish states.  */
+#define	WALPROHIBIT_TRANSITION_IN_PROGRESS	((uint32) 1 << 0)
+
+/* Extract last two bits */
+#define	WALPROHIBIT_CURRENT_STATE(stateGeneration)	\
+	((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define	WALPROHIBIT_NEXT_STATE(stateGeneration)	\
+	WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e715..2bcd37894f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetLastCheckPointSkipped(bool ChkptSkip);
+extern bool LastCheckPointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e9..f4dc5412ee6 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 33dacfd3403..ca9e853fe46 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11004,6 +11004,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4142', descr => 'alter system read only state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe7..3c1ad43b4b0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1027,6 +1027,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e6..ad5e3ba5724 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void SendsSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f38..bae06202b4a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 37bfa69edf0..fe735b548e5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2681,6 +2681,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v11-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v11-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 3e151947dc16587d59d4f2cb3272a8b22d414da1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v11 5/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 +++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 ++++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 +++++--
 src/backend/access/heap/vacuumlazy.c      | 18 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 +++++-
 src/backend/access/nbtree/nbtpage.c       | 39 +++++++++++++++++++----
 src/backend/access/spgist/spgdoinsert.c   | 13 ++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 27 ++++++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++--
 src/backend/commands/sequence.c           | 16 ++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++
 40 files changed, 463 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index eb96b4bb36d..53d8c9cea28 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c603..47142193706 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -759,6 +760,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b5..8b377a679ab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7c..fd766da445d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 82788a5c367..f31590dcd75 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7a2690e97f2..0abc5990100 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2e41b34d8d5..b8c2a993408 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77433dc8a41..989d82ffcaf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd363..b48ea1a746a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e53..d91ca2b391c 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f22..4a870a062ba 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -234,6 +238,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -465,9 +470,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -500,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -526,6 +534,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -567,7 +578,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1641,6 +1652,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1659,13 +1671,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1682,7 +1697,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c7724..bbb3ebb19ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c8..f4903a43bb5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -467,6 +468,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -573,6 +575,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -603,7 +609,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -690,6 +696,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -788,6 +795,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -809,7 +819,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -883,6 +893,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -890,7 +903,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967b..2eab69efa91 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 00f0a940116..e7c5dd3e3ce 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494a..55a867dd375 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1b2f70499e5..bde45e2f6e6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1898,6 +1899,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2172,6 +2175,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2690,6 +2695,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3442,6 +3449,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3615,6 +3624,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4548,6 +4559,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5339,6 +5352,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5497,6 +5512,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5605,6 +5622,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5721,6 +5740,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5751,6 +5771,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5761,7 +5785,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 9e04bc712c9..9e1ab73ee6e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 25f2d5df1b8..8ba1af0be63 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1481,6 +1486,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1498,7 +1506,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bcd..44244363968 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 9e535124c46..8d3c2a72012 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -227,6 +228,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index dde43b1415a..652966e679b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -19,6 +19,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
@@ -1230,6 +1231,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1882,13 +1885,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2450,6 +2456,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 848123d9218..7940a626efe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -179,6 +180,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -202,6 +204,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -214,7 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -332,6 +338,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -377,6 +384,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -395,7 +406,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1131,6 +1142,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1145,7 +1157,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	/* XLOG stuff -- allocate and fill buffer before critical section */
-	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	if (nupdatable > 0 && needwal)
 	{
 		Size		offset = 0;
 
@@ -1175,6 +1187,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 	}
 
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1227,7 +1242,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
@@ -1294,6 +1309,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		latestRemovedXid =
 			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1831,6 +1848,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1919,6 +1937,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1970,7 +1992,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2063,6 +2085,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2276,6 +2299,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2355,7 +2382,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 934d65b89f2..3c5a15c5d32 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f97..3308832b85b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index eb8de7cf329..b2d41925fb4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9bad98..5aef62724c6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a4944faa32e..0c7a2362f25 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 00c8894d806..f2b5bf2871a 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -24,6 +24,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 485e441225b..40835a89785 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 659376bb327..4dac37ab799 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1026,7 +1026,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2862,9 +2862,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8803,6 +8805,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -8832,6 +8836,8 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	AssertWALPermitted();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9060,6 +9066,8 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9217,6 +9225,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9875,7 +9885,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9889,10 +9899,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9914,8 +9924,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1f0e4e01e69..710806143d4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 632b34af610..b01ad5a966a 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 484f7ea2c0e..acc34d2ad7c 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bd66705a9af..da4b8d502ad 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -945,6 +945,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc0..a18a22350e2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3633,13 +3633,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c2..b05b0fe5f41 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f4554..f949a290745 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 671fbb0ed5c..90d7599a57c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..f3ff120601e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,12 +94,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -109,6 +134,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -135,6 +161,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v11-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchapplication/octet-stream; name=v11-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchDownload

From 15720547e8d3a32603027cd10163eb23d6766696 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2020 13:04:16 -0400
Subject: [PATCH v11 2/6] Test module for barriers. NOT FOR COMMIT.

---
 contrib/barrier/Makefile         | 23 ++++++++++++
 contrib/barrier/barrier--1.0.sql | 14 +++++++
 contrib/barrier/barrier.c        | 63 ++++++++++++++++++++++++++++++++
 contrib/barrier/barrier.control  |  5 +++
 4 files changed, 105 insertions(+)
 create mode 100644 contrib/barrier/Makefile
 create mode 100644 contrib/barrier/barrier--1.0.sql
 create mode 100644 contrib/barrier/barrier.c
 create mode 100644 contrib/barrier/barrier.control

diff --git a/contrib/barrier/Makefile b/contrib/barrier/Makefile
new file mode 100644
index 00000000000..71f59f6629e
--- /dev/null
+++ b/contrib/barrier/Makefile
@@ -0,0 +1,23 @@
+# contrib/barrier/Makefile
+
+MODULE_big = barrier
+OBJS = \
+	$(WIN32RES) \
+	barrier.o
+
+EXTENSION = barrier
+DATA = barrier--1.0.sql
+PGFILEDESC = "barrier - barrier test code NOT FOR COMMIT"
+
+REGRESS = barrier
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/barrier
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/barrier/barrier--1.0.sql b/contrib/barrier/barrier--1.0.sql
new file mode 100644
index 00000000000..66cae976a96
--- /dev/null
+++ b/contrib/barrier/barrier--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/barrier/barrier--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION barrier" to load this file. \quit
+
+CREATE FUNCTION emit_barrier(barrier_type text, count integer default 1)
+RETURNS void
+AS 'MODULE_PATHNAME', 'emit_barrier'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION wait_barrier(barrier_type text)
+RETURNS void
+AS 'MODULE_PATHNAME', 'wait_barrier'
+LANGUAGE C STRICT;
diff --git a/contrib/barrier/barrier.c b/contrib/barrier/barrier.c
new file mode 100644
index 00000000000..a0b98439924
--- /dev/null
+++ b/contrib/barrier/barrier.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.c
+ *	  emit ProcSignalBarriers for testing purposes
+ *
+ * Copyright (c) 2016-2020, PostgreSQL Global Development Group
+ *
+ *	  contrib/barrier/barrier.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/procsignal.h"
+#include "utils/builtins.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(emit_barrier);
+PG_FUNCTION_INFO_V1(wait_barrier);
+
+static ProcSignalBarrierType
+get_barrier_type(text *barrier_type)
+{
+	char	   *btype = text_to_cstring(barrier_type);
+
+	if (strcmp(btype, "placeholder") == 0)
+		return PROCSIGNAL_BARRIER_PLACEHOLDER;
+
+	elog(ERROR, "unknown barrier type: \"%s\"", btype);
+}
+
+Datum
+emit_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	int32		count = PG_GETARG_INT32(1);
+	int32		i;
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+
+	for (i = 0; i < count; ++i)
+	{
+		CHECK_FOR_INTERRUPTS();
+		EmitProcSignalBarrier(t);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+wait_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+	uint64		generation;
+
+	generation = EmitProcSignalBarrier(t);
+	elog(NOTICE, "waiting for barrier");
+	WaitForProcSignalBarrier(generation);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/barrier/barrier.control b/contrib/barrier/barrier.control
new file mode 100644
index 00000000000..425ffc15432
--- /dev/null
+++ b/contrib/barrier/barrier.control
@@ -0,0 +1,5 @@
+# barrier extension
+comment = 'emit ProcSignalBarrier for test purposes'
+default_version = '1.0'
+module_pathname = '$libdir/barrier'
+relocatable = true
-- 
2.18.0

v11-0001-Allow-for-error-or-refusal-while-absorbing-barri.patchapplication/octet-stream; name=v11-0001-Allow-for-error-or-refusal-while-absorbing-barri.patchDownload

From cf62f0034292f1070ee388cf8dd3fe95c5c1db25 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 6 Oct 2020 15:26:43 -0400
Subject: [PATCH v11 1/6] Allow for error or refusal while absorbing barriers.

Previously, the per-barrier-type functions tasked with absorbing
them were expected to always succeed and never throw an error.
However, that's a bit inconvenient. Further study has revealed that
there are realistic cases where it might not be possible to absorb
a ProcSignalBarrier without terminating the transaction, or even
the whole backend. Similarly, for some barrier types, there might
be other reasons where it's not reasonably possible to absorb the
barrier at certain points in the code, so provide a way for a
per-barrier-type function to reject absorbing the barrier.

Patch by me, reviewed by Andres Freund.

Discussion: http://postgr.es/m/20200908182005.xya7wetdh3pndzim@alap3.anarazel.de
---
 src/backend/storage/ipc/procsignal.c | 125 ++++++++++++++++++++++++---
 1 file changed, 113 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1c..9c3ebccd3e2 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -87,12 +88,17 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static void ResetProcSignalBarrierBits(uint32 flags);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -394,6 +400,12 @@ WaitForProcSignalBarrier(uint64 generation)
 		volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i];
 		uint64		oldval;
 
+		/*
+		 * It's important that we check only pss_barrierGeneration here and
+		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
+		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * is updated only afterward.
+		 */
 		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		while (oldval < generation)
 		{
@@ -453,7 +465,7 @@ ProcessProcSignalBarrier(void)
 {
 	uint64		local_gen;
 	uint64		shared_gen;
-	uint32		flags;
+	volatile uint32		flags;
 
 	Assert(MyProcSignalSlot);
 
@@ -482,21 +494,92 @@ ProcessProcSignalBarrier(void)
 	 * read of the barrier generation above happens before we atomically
 	 * extract the flags, and that any subsequent state changes happen
 	 * afterward.
+	 *
+	 * NB: In order to avoid race conditions, we must zero pss_barrierCheckMask
+	 * first and only afterwards try to do barrier processing. If we did it
+	 * in the other order, someone could send us another barrier of some
+	 * type right after we called the barrier-processing function but before
+	 * we cleared the bit. We would have no way of knowing that the bit needs
+	 * to stay set in that case, so the need to call the barrier-processing
+	 * function again would just get forgotten. So instead, we tentatively
+	 * clear all the bits and then put back any for which we don't manage
+	 * to successfully absorb the barrier.
 	 */
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		bool	success = true;
+
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			while (flags != 0)
+			{
+				ProcSignalBarrierType	type;
+				bool processed = true;
+
+				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
+				switch (type)
+				{
+					case PROCSIGNAL_BARRIER_PLACEHOLDER:
+						processed = ProcessBarrierPlaceholder();
+						break;
+				}
+
+				/*
+				 * To avoid an infinite loop, we must always unset the bit
+				 * in flags.
+				 */
+				BARRIER_CLEAR_BIT(flags, type);
+
+				/*
+				 * If we failed to process the barrier, reset the shared bit
+				 * so we try again later, and set a flag so that we don't bump
+				 * our generation.
+				 */
+				if (!processed)
+				{
+					ResetProcSignalBarrierBits(((uint32) 1) << type);
+					success = false;
+				}
+			}
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, we'll need to try again later to handle
+			 * that barrier type and any others that haven't been handled yet
+			 * or weren't successfully absorbed.
+			 */
+			ResetProcSignalBarrierBits(flags);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier types were not successfully absorbed, we will have
+		 * to try again later.
+		 */
+		if (!success)
+			return;
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +591,20 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
+/*
+ * If it turns out that we couldn't absorb one or more barrier types, either
+ * because the barrier-processing functions returned false or due to an error,
+ * arrange for processing to be retried later.
+ */
 static void
+ResetProcSignalBarrierBits(uint32 flags)
+{
+	pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask, flags);
+	ProcSignalBarrierPending = true;
+	InterruptPending = true;
+}
+
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +614,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.18.0

v11-0006-WIP-Documentation.patchapplication/octet-stream; name=v11-0006-WIP-Documentation.patchDownload

From efb62b722d624770d4b686dbd7ff8db465118c69 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v11 6/6] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v11-0003-Add-alter-system-read-only-write-syntax.patchapplication/octet-stream; name=v11-0003-Add-alter-system-read-only-write-syntax.patchDownload

From e79194f4adb77414ff191683bec2724d57261ea5 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v11 3/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 21 +++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 5a591d0a751..7ba114f6fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4010,6 +4010,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(walprohibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5393,6 +5402,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e2895a8985d..269683fb873 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1761,6 +1761,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(walprohibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3448,6 +3454,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f26498cea2d..69e11e913a0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1356,6 +1356,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(walprohibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3909,6 +3918,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ab7b535caae..d5acb438f6d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2550,6 +2550,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(walprohibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2872,6 +2885,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index efc9c997541..ceed52fbaae 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -479,6 +479,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10202,8 +10203,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->walprohibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 81ac9b1cb2d..f68b11d0af1 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -85,6 +85,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -219,6 +220,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -834,6 +836,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2813,6 +2820,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3669,3 +3677,16 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 8afc780acc3..d416106e0ab 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1889,9 +1889,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7ddd8c011bf..7b233925692 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -411,6 +411,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d1f9ef29ca0..6f7b4f0600f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3192,6 +3192,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		walprohibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701bfd4d..37bfa69edf0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -87,6 +87,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.18.0

#73

Robert Haas

robertmhaas@gmail.com

about 5 years ago

In reply to: Amul Sul (#62)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sat, Sep 12, 2020 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote:

So, if we're in the middle of a paced checkpoint with a large
checkpoint_timeout - a sensible real world configuration - we'll not
process ASRO until that checkpoint is over? That seems very much not
practical. What am I missing?

Yes, the process doing ASRO will wait until that checkpoint is over.

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes. We want this feature to
respond within milliseconds or a few seconds, not minutes. So we need
something better here. I'm inclined to think that we should try to
CompleteWALProhibitChange() at the same places we
AbsorbSyncRequests(). We know from experience that bad things happen
if we fail to absorb sync requests in a timely fashion, so we probably
have enough calls to AbsorbSyncRequests() to make sure that we always
do that work in a timely fashion. So, if we do this work in the same
place, then it will also be done in a timely fashion.

I'm not 100% sure whether that introduces any other problems.
Certainly, we're not going to be able to finish the checkpoint once
we've gone read-only, so we'll fail when we try to write the WAL
record for that, or maybe earlier if there's anything else that tries
to write WAL. Either the checkpoint needs to error out, like any other
attempt to write WAL, and we can attempt a new checkpoint if and when
we go read/write, or else we need to finish writing stuff out to disk
but not actually write the checkpoint completion record (or any other
WAL) unless and until the system goes back into read/write mode - and
then at that point the previously-started checkpoint will finish
normally. The latter seems better if we can make it work, but the
former is probably also acceptable. What you've got right now is not.

--
Robert Haas
EDB: http://www.enterprisedb.com

#74

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Robert Haas (#70)

Re: [Patch] ALTER SYSTEM READ ONLY

On 2020-11-20 11:23:44 -0500, Robert Haas wrote:

Andres, do you like the new loop better?

I do!

#75

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Robert Haas (#73)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On 2020-12-09 16:13:06 -0500, Robert Haas wrote:

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes.

Or hours, even. Due to the cost of FPWs it can make a lot of sense to
reduce the frequency of that cost...

We want this feature to respond within milliseconds or a few seconds,
not minutes. So we need something better here.

Indeed.

I'm inclined to think
that we should try to CompleteWALProhibitChange() at the same places
we AbsorbSyncRequests(). We know from experience that bad things
happen if we fail to absorb sync requests in a timely fashion, so we
probably have enough calls to AbsorbSyncRequests() to make sure that
we always do that work in a timely fashion. So, if we do this work in
the same place, then it will also be done in a timely fashion.

Sounds sane, without having looked in detail.

I'm not 100% sure whether that introduces any other problems.
Certainly, we're not going to be able to finish the checkpoint once
we've gone read-only, so we'll fail when we try to write the WAL
record for that, or maybe earlier if there's anything else that tries
to write WAL. Either the checkpoint needs to error out, like any other
attempt to write WAL, and we can attempt a new checkpoint if and when
we go read/write, or else we need to finish writing stuff out to disk
but not actually write the checkpoint completion record (or any other
WAL) unless and until the system goes back into read/write mode - and
then at that point the previously-started checkpoint will finish
normally. The latter seems better if we can make it work, but the
former is probably also acceptable. What you've got right now is not.

I mostly wonder which of those two has which implications for how many
FPWs we need to redo. Presumably stalling but not cancelling the current
checkpoint is better?

Greetings,

Andres Freund

#76

Amul Sul

sulamul@gmail.com

about 5 years ago

In reply to: Andres Freund (#75)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-12-09 16:13:06 -0500, Robert Haas wrote:

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes.

Or hours, even. Due to the cost of FPWs it can make a lot of sense to
reduce the frequency of that cost...

We want this feature to respond within milliseconds or a few seconds,
not minutes. So we need something better here.

Indeed.

I'm inclined to think
that we should try to CompleteWALProhibitChange() at the same places
we AbsorbSyncRequests(). We know from experience that bad things
happen if we fail to absorb sync requests in a timely fashion, so we
probably have enough calls to AbsorbSyncRequests() to make sure that
we always do that work in a timely fashion. So, if we do this work in
the same place, then it will also be done in a timely fashion.

Sounds sane, without having looked in detail.

Understood & agreed that we need to change the system state as soon as possible.

I can see AbsorbSyncRequests() is called from 4 routing as
CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
CheckpointerMain(). Out of 4, the first three executes with an interrupt is on
hod which will cause a problem when we do emit barrier and wait for those
barriers absorption by all the process including itself and will cause an
infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
do not wait on self to absorb barrier. Let that get absorbed at a later point
in time when the interrupt is resumed. I assumed that we cannot do barrier
processing right away since there could be other barriers (maybe in the future)
including ours that should not process while the interrupt is on hold.

I'm not 100% sure whether that introduces any other problems.
Certainly, we're not going to be able to finish the checkpoint once
we've gone read-only, so we'll fail when we try to write the WAL
record for that, or maybe earlier if there's anything else that tries
to write WAL. Either the checkpoint needs to error out, like any other
attempt to write WAL, and we can attempt a new checkpoint if and when
we go read/write, or else we need to finish writing stuff out to disk
but not actually write the checkpoint completion record (or any other
WAL) unless and until the system goes back into read/write mode - and
then at that point the previously-started checkpoint will finish
normally. The latter seems better if we can make it work, but the
former is probably also acceptable. What you've got right now is not.

I mostly wonder which of those two has which implications for how many
FPWs we need to redo. Presumably stalling but not cancelling the current
checkpoint is better?

Also, I like to uphold this idea of stalling a checkpointer's work in the middle
instead of canceling it. But here, we need to take care of shutdown requests and
death of postmaster cases that can cancel this stalling. If that happens we
need to make sure that no unwanted wal insertion happens afterward and for that
LocalXLogInsertAllowed flag needs to be updated correctly since the wal
prohibits barrier processing was skipped for the checkpointer since it emits
that barrier as mentioned above.

Regards,
Amul

#77

Amul Sul

sulamul@gmail.com

about 5 years ago

In reply to: Amul Sul (#76)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-12-09 16:13:06 -0500, Robert Haas wrote:

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes.

Or hours, even. Due to the cost of FPWs it can make a lot of sense to
reduce the frequency of that cost...

We want this feature to respond within milliseconds or a few seconds,
not minutes. So we need something better here.

Indeed.

I'm inclined to think
that we should try to CompleteWALProhibitChange() at the same places
we AbsorbSyncRequests(). We know from experience that bad things
happen if we fail to absorb sync requests in a timely fashion, so we
probably have enough calls to AbsorbSyncRequests() to make sure that
we always do that work in a timely fashion. So, if we do this work in
the same place, then it will also be done in a timely fashion.

Sounds sane, without having looked in detail.

Understood & agreed that we need to change the system state as soon as possible.

I can see AbsorbSyncRequests() is called from 4 routing as
CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
CheckpointerMain(). Out of 4, the first three executes with an interrupt is on
hod which will cause a problem when we do emit barrier and wait for those
barriers absorption by all the process including itself and will cause an
infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
do not wait on self to absorb barrier. Let that get absorbed at a later point
in time when the interrupt is resumed. I assumed that we cannot do barrier
processing right away since there could be other barriers (maybe in the future)
including ours that should not process while the interrupt is on hold.

CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
which puts interrupt on hold. This kinda surprising that we were holding this
lock and putting interrupt on hots for a long time. We do need that
CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
something easy to ensure that instead of the lock? Probably holding off
interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?

Regards,
Amul

#78

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#77)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Dec 14, 2020 at 8:03 PM Amul Sul <sulamul@gmail.com> wrote:

On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-12-09 16:13:06 -0500, Robert Haas wrote:

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes.

Or hours, even. Due to the cost of FPWs it can make a lot of sense to
reduce the frequency of that cost...

We want this feature to respond within milliseconds or a few seconds,
not minutes. So we need something better here.

Indeed.

I'm inclined to think
that we should try to CompleteWALProhibitChange() at the same places
we AbsorbSyncRequests(). We know from experience that bad things
happen if we fail to absorb sync requests in a timely fashion, so we
probably have enough calls to AbsorbSyncRequests() to make sure that
we always do that work in a timely fashion. So, if we do this work in
the same place, then it will also be done in a timely fashion.

Sounds sane, without having looked in detail.

Understood & agreed that we need to change the system state as soon as possible.

I can see AbsorbSyncRequests() is called from 4 routing as
CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
CheckpointerMain(). Out of 4, the first three executes with an interrupt is on
hod which will cause a problem when we do emit barrier and wait for those
barriers absorption by all the process including itself and will cause an
infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
do not wait on self to absorb barrier. Let that get absorbed at a later point
in time when the interrupt is resumed. I assumed that we cannot do barrier
processing right away since there could be other barriers (maybe in the future)
including ours that should not process while the interrupt is on hold.

CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
which puts interrupt on hold. This kinda surprising that we were holding this
lock and putting interrupt on hots for a long time. We do need that
CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
something easy to ensure that instead of the lock? Probably holding off
interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?

To move development, testing, and the review forward, I have commented out the
code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
added the changes for the checkpointer so that system read-write state change
request can be processed as soon as possible, as suggested by Robert[1].

I have started a new thread[2] to understand the need for the CheckpointLock in
CreateCheckPoint() function. Until then we can continue work on this feature by
skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is
marked WIP.

1] /messages/by-id/CA+TgmoYexwDQjdd1=15KMz+7VfHVx8VHNL2qjRRK92P=CSZDxg@mail.gmail.com
2] /messages/by-id/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com

Regards,
Amul

Attachments:

v12-0001-Allow-for-error-or-refusal-while-absorbing-barri.patchapplication/x-patch; name=v12-0001-Allow-for-error-or-refusal-while-absorbing-barri.patchDownload

From b9304d1f557e02065898d61badae867901556a74 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 6 Oct 2020 15:26:43 -0400
Subject: [PATCH v12 1/6] Allow for error or refusal while absorbing barriers.

Previously, the per-barrier-type functions tasked with absorbing
them were expected to always succeed and never throw an error.
However, that's a bit inconvenient. Further study has revealed that
there are realistic cases where it might not be possible to absorb
a ProcSignalBarrier without terminating the transaction, or even
the whole backend. Similarly, for some barrier types, there might
be other reasons where it's not reasonably possible to absorb the
barrier at certain points in the code, so provide a way for a
per-barrier-type function to reject absorbing the barrier.

Patch by me, reviewed by Andres Freund.

Discussion: http://postgr.es/m/20200908182005.xya7wetdh3pndzim@alap3.anarazel.de
---
 src/backend/storage/ipc/procsignal.c | 125 ++++++++++++++++++++++++---
 1 file changed, 113 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 583efaecff8..c43cdd685b4 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -87,12 +88,17 @@ typedef struct
 #define BARRIER_SHOULD_CHECK(flags, type) \
 	(((flags) & (((uint32) 1) << (uint32) (type))) != 0)
 
+/* Clear the relevant type bit from the flags. */
+#define BARRIER_CLEAR_BIT(flags, type) \
+	((flags) &= ~(((uint32) 1) << (uint32) (type)))
+
 static ProcSignalHeader *ProcSignal = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
-static void ProcessBarrierPlaceholder(void);
+static void ResetProcSignalBarrierBits(uint32 flags);
+static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -394,6 +400,12 @@ WaitForProcSignalBarrier(uint64 generation)
 		volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i];
 		uint64		oldval;
 
+		/*
+		 * It's important that we check only pss_barrierGeneration here and
+		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
+		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * is updated only afterward.
+		 */
 		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		while (oldval < generation)
 		{
@@ -453,7 +465,7 @@ ProcessProcSignalBarrier(void)
 {
 	uint64		local_gen;
 	uint64		shared_gen;
-	uint32		flags;
+	volatile uint32		flags;
 
 	Assert(MyProcSignalSlot);
 
@@ -482,21 +494,92 @@ ProcessProcSignalBarrier(void)
 	 * read of the barrier generation above happens before we atomically
 	 * extract the flags, and that any subsequent state changes happen
 	 * afterward.
+	 *
+	 * NB: In order to avoid race conditions, we must zero pss_barrierCheckMask
+	 * first and only afterwards try to do barrier processing. If we did it
+	 * in the other order, someone could send us another barrier of some
+	 * type right after we called the barrier-processing function but before
+	 * we cleared the bit. We would have no way of knowing that the bit needs
+	 * to stay set in that case, so the need to call the barrier-processing
+	 * function again would just get forgotten. So instead, we tentatively
+	 * clear all the bits and then put back any for which we don't manage
+	 * to successfully absorb the barrier.
 	 */
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
-	 * Process each type of barrier. It's important that nothing we call from
-	 * here throws an error, because pss_barrierCheckMask has already been
-	 * cleared. If we jumped out of here before processing all barrier types,
-	 * then we'd forget about the need to do so later.
-	 *
-	 * NB: It ought to be OK to call the barrier-processing functions
-	 * unconditionally, but it's more efficient to call only the ones that
-	 * might need us to do something based on the flags.
+	 * If there are no flags set, then we can skip doing any real work.
+	 * Otherwise, establish a PG_TRY block, so that we don't lose track of
+	 * which types of barrier processing are needed if an ERROR occurs.
 	 */
-	if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-		ProcessBarrierPlaceholder();
+	if (flags != 0)
+	{
+		bool	success = true;
+
+		PG_TRY();
+		{
+			/*
+			 * Process each type of barrier. The barrier-processing functions
+			 * should normally return true, but may return false if the barrier
+			 * can't be absorbed at the current time. This should be rare,
+			 * because it's pretty expensive.  Every single
+			 * CHECK_FOR_INTERRUPTS() will return here until we manage to
+			 * absorb the barrier, and that cost will add up in a hurry.
+			 *
+			 * NB: It ought to be OK to call the barrier-processing functions
+			 * unconditionally, but it's more efficient to call only the ones
+			 * that might need us to do something based on the flags.
+			 */
+			while (flags != 0)
+			{
+				ProcSignalBarrierType	type;
+				bool processed = true;
+
+				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
+				switch (type)
+				{
+					case PROCSIGNAL_BARRIER_PLACEHOLDER:
+						processed = ProcessBarrierPlaceholder();
+						break;
+				}
+
+				/*
+				 * To avoid an infinite loop, we must always unset the bit
+				 * in flags.
+				 */
+				BARRIER_CLEAR_BIT(flags, type);
+
+				/*
+				 * If we failed to process the barrier, reset the shared bit
+				 * so we try again later, and set a flag so that we don't bump
+				 * our generation.
+				 */
+				if (!processed)
+				{
+					ResetProcSignalBarrierBits(((uint32) 1) << type);
+					success = false;
+				}
+			}
+		}
+		PG_CATCH();
+		{
+			/*
+			 * If an ERROR occurred, we'll need to try again later to handle
+			 * that barrier type and any others that haven't been handled yet
+			 * or weren't successfully absorbed.
+			 */
+			ResetProcSignalBarrierBits(flags);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		/*
+		 * If some barrier types were not successfully absorbed, we will have
+		 * to try again later.
+		 */
+		if (!success)
+			return;
+	}
 
 	/*
 	 * State changes related to all types of barriers that might have been
@@ -508,7 +591,20 @@ ProcessProcSignalBarrier(void)
 	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
+/*
+ * If it turns out that we couldn't absorb one or more barrier types, either
+ * because the barrier-processing functions returned false or due to an error,
+ * arrange for processing to be retried later.
+ */
 static void
+ResetProcSignalBarrierBits(uint32 flags)
+{
+	pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask, flags);
+	ProcSignalBarrierPending = true;
+	InterruptPending = true;
+}
+
+static bool
 ProcessBarrierPlaceholder(void)
 {
 	/*
@@ -518,7 +614,12 @@ ProcessBarrierPlaceholder(void)
 	 * appropriately descriptive. Get rid of this function and instead have
 	 * ProcessBarrierSomethingElse. Most likely, that function should live in
 	 * the file pertaining to that subsystem, rather than here.
+	 *
+	 * The return value should be 'true' if the barrier was successfully
+	 * absorbed and 'false' if not. Note that returning 'false' can lead to
+	 * very frequent retries, so try hard to make that an uncommon case.
 	 */
+	return true;
 }
 
 /*
-- 
2.18.0

v12-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchapplication/x-patch; name=v12-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patchDownload

From 2d1c0150e8f41ce15bcea3e052714ec0e7730e70 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2020 13:04:16 -0400
Subject: [PATCH v12 2/6] Test module for barriers. NOT FOR COMMIT.

---
 contrib/barrier/Makefile         | 23 ++++++++++++
 contrib/barrier/barrier--1.0.sql | 14 +++++++
 contrib/barrier/barrier.c        | 63 ++++++++++++++++++++++++++++++++
 contrib/barrier/barrier.control  |  5 +++
 4 files changed, 105 insertions(+)
 create mode 100644 contrib/barrier/Makefile
 create mode 100644 contrib/barrier/barrier--1.0.sql
 create mode 100644 contrib/barrier/barrier.c
 create mode 100644 contrib/barrier/barrier.control

diff --git a/contrib/barrier/Makefile b/contrib/barrier/Makefile
new file mode 100644
index 00000000000..71f59f6629e
--- /dev/null
+++ b/contrib/barrier/Makefile
@@ -0,0 +1,23 @@
+# contrib/barrier/Makefile
+
+MODULE_big = barrier
+OBJS = \
+	$(WIN32RES) \
+	barrier.o
+
+EXTENSION = barrier
+DATA = barrier--1.0.sql
+PGFILEDESC = "barrier - barrier test code NOT FOR COMMIT"
+
+REGRESS = barrier
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/barrier
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/barrier/barrier--1.0.sql b/contrib/barrier/barrier--1.0.sql
new file mode 100644
index 00000000000..66cae976a96
--- /dev/null
+++ b/contrib/barrier/barrier--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/barrier/barrier--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION barrier" to load this file. \quit
+
+CREATE FUNCTION emit_barrier(barrier_type text, count integer default 1)
+RETURNS void
+AS 'MODULE_PATHNAME', 'emit_barrier'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION wait_barrier(barrier_type text)
+RETURNS void
+AS 'MODULE_PATHNAME', 'wait_barrier'
+LANGUAGE C STRICT;
diff --git a/contrib/barrier/barrier.c b/contrib/barrier/barrier.c
new file mode 100644
index 00000000000..a0b98439924
--- /dev/null
+++ b/contrib/barrier/barrier.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.c
+ *	  emit ProcSignalBarriers for testing purposes
+ *
+ * Copyright (c) 2016-2020, PostgreSQL Global Development Group
+ *
+ *	  contrib/barrier/barrier.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/procsignal.h"
+#include "utils/builtins.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(emit_barrier);
+PG_FUNCTION_INFO_V1(wait_barrier);
+
+static ProcSignalBarrierType
+get_barrier_type(text *barrier_type)
+{
+	char	   *btype = text_to_cstring(barrier_type);
+
+	if (strcmp(btype, "placeholder") == 0)
+		return PROCSIGNAL_BARRIER_PLACEHOLDER;
+
+	elog(ERROR, "unknown barrier type: \"%s\"", btype);
+}
+
+Datum
+emit_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	int32		count = PG_GETARG_INT32(1);
+	int32		i;
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+
+	for (i = 0; i < count; ++i)
+	{
+		CHECK_FOR_INTERRUPTS();
+		EmitProcSignalBarrier(t);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+wait_barrier(PG_FUNCTION_ARGS)
+{
+	text	   *barrier_type = PG_GETARG_TEXT_PP(0);
+	ProcSignalBarrierType t = get_barrier_type(barrier_type);
+	uint64		generation;
+
+	generation = EmitProcSignalBarrier(t);
+	elog(NOTICE, "waiting for barrier");
+	WaitForProcSignalBarrier(generation);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/barrier/barrier.control b/contrib/barrier/barrier.control
new file mode 100644
index 00000000000..425ffc15432
--- /dev/null
+++ b/contrib/barrier/barrier.control
@@ -0,0 +1,5 @@
+# barrier extension
+comment = 'emit ProcSignalBarrier for test purposes'
+default_version = '1.0'
+module_pathname = '$libdir/barrier'
+relocatable = true
-- 
2.18.0

v12-0004-WIP-Implement-ALTER-SYSTEM-READ-ONLY-using-globa.patchapplication/x-patch; name=v12-0004-WIP-Implement-ALTER-SYSTEM-READ-ONLY-using-globa.patchDownload

From 1615be569c0f5a3887e0acc38d3f4f17a4e1dd7b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v12 4/6] WIP - Implement ALTER SYSTEM READ ONLY using global
 barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited using
    ALTER SYSTEM READ ONLY command or by calling
    pg_alter_wal_prohibit_state(true) sql function, the current state
    generation to inprogress in shared memory marked and signaled
    checkpointer process.  Checkpointer, noticing that the current state
    generation has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.
 5. The Autovacuum launcher, as well as the checkpointer, will not do
    anything while in the WAL-Prohibited server state until someone wakes
    up.  E.g. user might, later on, request us to put the system back to
    read-write by executing ALTER SYSTEM READ WRITE.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.

 8. To execute ALTER SYSTEM READ ONLY/WRITE, the user should have execute
    permssion on pg_alter_wal_prohibit_state() function.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.

=====
TODO:
=====
 1. Commented CheckpointLock acquiring code in CreateCheckPoint() to make
    the latest changes to the checkpointer to process wal prohibit state
    change request ASAP.  We might want to depreciate  CheckpointLock
    completely from CreateCheckPoint(). The same discussion is started on
    pg_hackers[1]. Until then this patch mark as WIP.

====
REF:
====
 1. http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com

wip - add new function call from middle of checkpointer
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 464 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +-
 src/backend/access/transam/xlog.c        | 113 +++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   4 +
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  18 +
 src/backend/postmaster/pgstat.c          |   3 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |  15 +-
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  93 +++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   1 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 788 insertions(+), 74 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..de1f2f47040
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,464 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state generation and the last two bits of
+	 * this generation indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 shared_state_generation;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void RequestWALProhibitChange(uint32 cur_state_gen);
+static void CompleteWALProhibitChange(uint32 cur_state_gen);
+static uint32 GetWALProhibitStateGen(void);
+static uint32 SetWALProhibitState(bool wal_prohibited, bool is_final_state);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * AlterSystemSetWALProhibitState()
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	/* Check permission for pg_alter_wal_prohibit_state() */
+	if (pg_proc_aclcheck(F_PG_ALTER_WAL_PROHIBIT_STATE,
+						 GetUserId(), ACL_EXECUTE) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for command ALTER SYSTEM"),
+				 errhint("Get execute permission for pg_alter_wal_prohibit_state() to this user.")));
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("ALTER SYSTEM");
+
+	/* Execute function to alter wal prohibit state */
+	(void) OidFunctionCall1(F_PG_ALTER_WAL_PROHIBIT_STATE,
+							BoolGetDatum(stmt->walprohibited));
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+	uint32		cur_state_gen;
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	cur_state_gen = SetWALProhibitState(walprohibited, false);
+
+	/* Server is already in requested state */
+	if (!cur_state_gen)
+		PG_RETURN_VOID();
+
+	/*
+	 * Signal the checkpointer to do the actual state transition, and wait for
+	 * the state change to occur.
+	 */
+	RequestWALProhibitChange(cur_state_gen);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	uint32 		cur_state = WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * RequestWALProhibitChange()
+ *
+ * Request checkpointer to make the WALProhibitState to read-only.
+ */
+static void
+RequestWALProhibitChange(uint32 cur_state_gen)
+{
+	/* Must not be called from checkpointer */
+	Assert(!AmCheckpointerProcess());
+	Assert(GetWALProhibitStateGen() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(cur_state_gen);
+		return;
+	}
+
+	/* Signal checkpointer process */
+	SendsSignalToCheckpointer(SIGINT);
+
+	/* Wait for the state to change to read-only */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once wal prohibit state generation changes */
+		if (GetWALProhibitStateGen() != cur_state_gen)
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(uint32 cur_state_gen)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Since we do not process barrier for ourself, for now, return to "check"
+	 * state.
+	 */
+	ResetLocalXLogInsertAllowed();
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+	wal_prohibited =
+		(WALPROHIBIT_NEXT_STATE(cur_state_gen) == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Set the final state */
+	(void) SetWALProhibitState(wal_prohibited, true);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (LastCheckPointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetLastCheckPointSkipped(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	uint32		wal_state_gen;
+
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		wal_state_gen = GetWALProhibitStateGen();
+
+		if (wal_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			CompleteWALProhibitChange(wal_state_gen);
+			continue; /* check changed state */
+		}
+		else if (WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+				 WALPROHIBIT_STATE_READ_ONLY)
+		{
+			int			rc;
+
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write wal prohibit sate.
+			 */
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_CHECKPOINTER_MAIN);
+
+			/*
+			 * If the postmaster dies or a shutdown request is received, just
+			 * bail out.
+			 */
+			if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+				return;
+
+			/* Re-check wal prohibit state */
+			continue;
+		}
+
+		Assert(WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
+		break; /* Done */
+	}
+}
+
+/*
+ * GetWALProhibitStateGen()
+ *
+ * Atomically return the current server WAL prohibited state generation.
+ */
+static uint32
+GetWALProhibitStateGen(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->shared_state_generation);
+}
+
+/*
+ * SetWALProhibitState()
+ *
+ * Increments current shared WAL prohibit state generation concerning to
+ * requested state and returns the same.
+ *
+ * For the transition state request where is_final_state is false if the server
+ * desired transition state is the same as the current state which might have
+ * been requested by some other backend and has been proceeded then the current
+ * wal prohibit generation will be returned so that this backend can wait until
+ * the shared wal prohibited generation change for the final state.  And, if the
+ * server is already completely moved to the requested state then the requester
+ * backend doesn't need to wait, in that case, 0 will be returned.
+ *
+ * The final state can only be requested by the checkpointer or by the
+ * single-user so that there will be no chance that the server is already in the
+ * desired final state.
+ */
+static uint32
+SetWALProhibitState(bool wal_prohibited, bool is_final_state)
+{
+	uint32		new_state;
+	uint32		cur_state;
+	uint32		cur_state_gen;
+	uint32		next_state_gen;
+
+	/* Get the current state */
+	cur_state_gen = GetWALProhibitStateGen();
+	cur_state = WALPROHIBIT_CURRENT_STATE(cur_state_gen);
+
+	/* Compute new state */
+	if (is_final_state)
+	{
+		/*
+		 * Only checkpointer or single-user can set the final wal prohibit
+		 * state.
+		 */
+		Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+		/*
+		 * There won't be any other process for the final state setting so that
+		 * the next final state will be the desired state.
+		 */
+		Assert(WALPROHIBIT_NEXT_STATE(cur_state) == new_state);
+	}
+	else
+	{
+		new_state = wal_prohibited ?
+			WALPROHIBIT_STATE_GOING_READ_ONLY :
+			WALPROHIBIT_STATE_GOING_READ_WRITE;
+
+		/* Server is already in the requested transition state */
+		if (cur_state == new_state)
+			return cur_state;		/* Wait for state transition completion */
+
+		/* Server is already in requested state */
+		if (WALPROHIBIT_NEXT_STATE(new_state) == cur_state)
+			return 0;		/* No wait is needed */
+
+		/* Prevent concurrent contrary in progress transition state setting */
+		if (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+		}
+	}
+
+	/*
+	 * Update new state generation in share memory only if the state generation
+	 * hasn't changed until now we have checked.
+	 */
+	next_state_gen = cur_state_gen + 1;
+	(void) pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+										  &cur_state_gen, next_state_gen);
+
+	/* To be sure that any later reads of memory happen strictly after this. */
+	pg_memory_barrier();
+
+	return next_state_gen;
+}
+
+/*
+ * WALProhibitStateGenerationInit()
+ *
+ * Initialization of shared wal prohibit state generation.
+ */
+void
+WALProhibitStateGenerationInit(bool wal_prohibited)
+{
+	uint32	new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibitState->shared_state_generation, new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (found)
+		return;
+
+	/* First time through ... */
+	memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+	ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd45..0de63af6365 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c1980..e31327ed5c7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -734,6 +736,11 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastCheckPointSkipped indicates if the last checkpoint is skipped.
+	 */
+	bool		lastCheckPointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6219,6 +6226,32 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetLastCheckPointSkipped(bool ChkptSkip)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->lastCheckPointSkipped = ChkptSkip;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Return value of lastCheckPointSkipped flag.
+ */
+bool
+LastCheckPointIsSkipped(void)
+{
+	bool	ChkptSkipped;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ChkptSkipped = XLogCtl->lastCheckPointSkipped;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return ChkptSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7732,6 +7765,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateGenerationInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7742,7 +7781,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetLastCheckPointSkipped(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7988,6 +8037,16 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8203,9 +8262,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8224,9 +8283,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8248,6 +8318,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8537,9 +8613,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8552,6 +8632,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
@@ -8820,7 +8903,11 @@ CreateCheckPoint(int flags)
 	 * only one process that is allowed to issue checkpoints at any given
 	 * time.)
 	 */
-	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+	//======================================================================================
+	// TODO: Tbc, the discussion on this lock requirement is in progress at
+	// http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
+	//======================================================================================
+	//LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
 	/*
 	 * Prepare to accumulate statistics.
@@ -8891,7 +8978,7 @@ CreateCheckPoint(int flags)
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
-			LWLockRelease(CheckpointLock);
+			//LWLockRelease(CheckpointLock);			TODO: Tbc
 			END_CRIT_SECTION();
 			ereport(DEBUG1,
 					(errmsg("checkpoint skipped because system is idle")));
@@ -9192,7 +9279,7 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_removed,
 									 CheckpointStats.ckpt_segs_recycled);
 
-	LWLockRelease(CheckpointLock);
+	//LWLockRelease(CheckpointLock);	TODO: Tbc
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5d89e77dbe2..f8ae90c03c7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1518,6 +1518,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 47e60ca5613..516d6cd032a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -663,6 +663,10 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		HandleAutoVacLauncherInterrupts();
 
+		/* If the server is read only just go back to sleep. */
+		if (!XLogInsertAllowed())
+			continue;
+
 		/*
 		 * a worker finished, or postmaster signaled failure to start a worker
 		 */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf611..1324d494724 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,16 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendsSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+void
+SendsSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		elog(ERROR, "checkpointer is not running");
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		elog(ERROR, "could not signal checkpointer: %m");
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f24a33ef1d..76a440836a2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4227,6 +4227,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b4..31383a11d08 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -604,24 +604,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index fe143151cc5..1c7b40563b5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index e9cc4a22324..d295a1dc15f 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
@@ -86,7 +87,6 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
-static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -3686,16 +3686,3 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
-
-/*
- * AlterSystemSetWALProhibitState
- *
- * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
- */
-static void
-AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
-{
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
-}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca9..6d9c23f2d9f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2048,6 +2050,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12218,4 +12232,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..800542de123
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,93 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateGenerationInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * The WAL Prohibit States.
+ *
+ * 	The odd number represents the transition state and whereas the even number
+ * 	represents the final state.  These states can be distinguished by checking
+ * 	the 0th bits aka transition bit.
+ */
+#define	WALPROHIBIT_STATE_READ_WRITE		(uint32) 0	/* WAL permitted */
+#define	WALPROHIBIT_STATE_GOING_READ_ONLY	(uint32) 1
+#define	WALPROHIBIT_STATE_READ_ONLY			(uint32) 2	/* WAL prohibited */
+#define	WALPROHIBIT_STATE_GOING_READ_WRITE	(uint32) 3
+
+/* The transition bit to distinguish states.  */
+#define	WALPROHIBIT_TRANSITION_IN_PROGRESS	((uint32) 1 << 0)
+
+/* Extract last two bits */
+#define	WALPROHIBIT_CURRENT_STATE(stateGeneration)	\
+	((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define	WALPROHIBIT_NEXT_STATE(stateGeneration)	\
+	WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..d348155b1ac 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetLastCheckPointSkipped(bool ChkptSkip);
+extern bool LastCheckPointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d27336adcd9..8a54e35b8ff 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11329,6 +11329,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'alter system read only state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c38b6897101..491cf78a905 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1033,6 +1033,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..b29884ba0d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void SendsSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4ba4ec1cbdc..8484993d1c1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2696,6 +2696,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v12-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v12-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From e6bce84a1be1cf4b18a2e395a19425b7d956bbf9 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v12 5/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 26 ++++++++++++-----
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 459 insertions(+), 69 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff0..9e195e4bc6f 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 992936cfa8e..affbd519328 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -466,9 +471,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -501,7 +509,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -527,6 +535,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -568,7 +579,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1643,6 +1654,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1661,13 +1673,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1684,7 +1699,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e127639..e0c483171ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5b9cfb26cf7..d97dd657f76 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1916,6 +1917,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2190,6 +2193,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 										   &vmbuffer, NULL);
 		page = BufferGetPage(buffer);
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2708,6 +2713,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3460,6 +3467,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3633,6 +3642,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4566,6 +4577,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5357,6 +5370,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5515,6 +5530,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5623,6 +5640,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5743,6 +5762,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5773,6 +5793,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5783,7 +5807,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index e3a716a2a2f..e93c211da4f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad7..7370dc1f64d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1481,6 +1486,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1498,7 +1506,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3336039125..c840912d116 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c28..9a9e2fc2b3c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -185,6 +186,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -208,6 +210,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -220,7 +226,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -338,6 +344,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -383,6 +390,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -401,7 +412,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1148,6 +1159,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1279,6 +1293,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2065,6 +2081,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2153,6 +2170,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2204,7 +2225,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2297,6 +2318,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2511,6 +2533,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2590,7 +2616,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222e..d689473a713 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1233448481a..5a665caa87b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b778324..91f2b18e367 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2264c2c849c..15cb9b9a25c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index de1f2f47040..60e368f6bcb 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0de63af6365..cdb18c47b0c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e31327ed5c7..2ba93bfad5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1036,7 +1036,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2873,9 +2873,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8886,6 +8888,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9150,6 +9154,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+ 	/* Error out if wal writes are disabled. */
+ 	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9310,6 +9317,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9974,7 +9983,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -9988,10 +9997,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10013,8 +10022,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1324d494724..dcec7d57471 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 20e50247ea4..693601de238 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v12-0003-Add-alter-system-read-only-write-syntax.patchapplication/x-patch; name=v12-0003-Add-alter-system-read-only-write-syntax.patchDownload

From 4750542d598361b021801c4d56ecdc7603912ef7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Mar 2020 05:05:38 -0400
Subject: [PATCH v12 3/6] Add alter system read only/write syntax

Note that syntax doesn't have any implementation.
---
 src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
 src/backend/nodes/equalfuncs.c   |  9 +++++++++
 src/backend/nodes/outfuncs.c     | 12 ++++++++++++
 src/backend/nodes/readfuncs.c    | 15 +++++++++++++++
 src/backend/parser/gram.y        | 13 +++++++++++++
 src/backend/tcop/utility.c       | 21 +++++++++++++++++++++
 src/bin/psql/tab-complete.c      |  6 ++++--
 src/include/nodes/nodes.h        |  1 +
 src/include/nodes/parsenodes.h   | 10 ++++++++++
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ba3ccc712c8..7ce1a3146bf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4028,6 +4028,15 @@ _copyAlterSystemStmt(const AlterSystemStmt *from)
 	return newnode;
 }
 
+static AlterSystemWALProhibitState *
+_copyAlterSystemWALProhibitState(const AlterSystemWALProhibitState *from)
+{
+	AlterSystemWALProhibitState *newnode = makeNode(AlterSystemWALProhibitState);
+
+	COPY_SCALAR_FIELD(walprohibited);
+	return newnode;
+}
+
 static CreateSeqStmt *
 _copyCreateSeqStmt(const CreateSeqStmt *from)
 {
@@ -5414,6 +5423,9 @@ copyObjectImpl(const void *from)
 		case T_AlterSystemStmt:
 			retval = _copyAlterSystemStmt(from);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _copyAlterSystemWALProhibitState(from);
+			break;
 		case T_CreateSeqStmt:
 			retval = _copyCreateSeqStmt(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index a2ef853dc2a..fd54fe3088a 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1777,6 +1777,12 @@ _equalAlterSystemStmt(const AlterSystemStmt *a, const AlterSystemStmt *b)
 	return true;
 }
 
+static bool
+_equalAlterSystemWALProhibitState(const AlterSystemWALProhibitState *a, const AlterSystemWALProhibitState *b)
+{
+	COMPARE_SCALAR_FIELD(walprohibited);
+	return true;
+}
 
 static bool
 _equalCreateSeqStmt(const CreateSeqStmt *a, const CreateSeqStmt *b)
@@ -3467,6 +3473,9 @@ equal(const void *a, const void *b)
 		case T_AlterSystemStmt:
 			retval = _equalAlterSystemStmt(a, b);
 			break;
+		case T_AlterSystemWALProhibitState:
+			retval = _equalAlterSystemWALProhibitState(a, b);
+			break;
 		case T_CreateSeqStmt:
 			retval = _equalCreateSeqStmt(a, b);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8392be6d44a..47230ad346a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1359,6 +1359,15 @@ _outAlternativeSubPlan(StringInfo str, const AlternativeSubPlan *node)
 	WRITE_NODE_FIELD(subplans);
 }
 
+static void
+_outAlterSystemWALProhibitState(StringInfo str,
+								const AlterSystemWALProhibitState *node)
+{
+	WRITE_NODE_TYPE("ALTERSYSTEMWALPROHIBITSTATE");
+
+	WRITE_BOOL_FIELD(walprohibited);
+}
+
 static void
 _outFieldSelect(StringInfo str, const FieldSelect *node)
 {
@@ -3938,6 +3947,9 @@ outNode(StringInfo str, const void *obj)
 			case T_AlternativeSubPlan:
 				_outAlternativeSubPlan(str, obj);
 				break;
+			case T_AlterSystemWALProhibitState:
+				_outAlterSystemWALProhibitState(str, obj);
+				break;
 			case T_FieldSelect:
 				_outFieldSelect(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d2c8d58070b..f5f3c8cff89 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2553,6 +2553,19 @@ _readAlternativeSubPlan(void)
 	READ_DONE();
 }
 
+/*
+ * _readAlterSystemWALProhibitState
+ */
+static AlterSystemWALProhibitState *
+_readAlterSystemWALProhibitState(void)
+{
+	READ_LOCALS(AlterSystemWALProhibitState);
+
+	READ_BOOL_FIELD(walprohibited);
+
+	READ_DONE();
+}
+
 /*
  * _readExtensibleNode
  */
@@ -2875,6 +2888,8 @@ parseNodeString(void)
 		return_value = _readSubPlan();
 	else if (MATCH("ALTERNATIVESUBPLAN", 18))
 		return_value = _readAlternativeSubPlan();
+	else if (MATCH("ALTERSYSTEMWALPROHIBITSTATE", 27))
+		return_value = _readAlterSystemWALProhibitState();
 	else if (MATCH("EXTENSIBLENODE", 14))
 		return_value = _readExtensibleNode();
 	else if (MATCH("PARTITIONBOUNDSPEC", 18))
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 31c95443a5b..db39f7caa91 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -477,6 +477,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <vsetstmt> generic_set set_rest set_rest_more generic_reset reset_rest
 				 SetResetClause FunctionSetResetClause
+%type <boolean> system_readonly_state
 
 %type <node>	TableElement TypedTableElement ConstraintElem TableFuncElement
 %type <node>	columnDef columnOptions
@@ -10227,8 +10228,20 @@ AlterSystemStmt:
 					n->setstmt = $4;
 					$$ = (Node *)n;
 				}
+			| ALTER SYSTEM_P system_readonly_state
+				{
+					AlterSystemWALProhibitState *n = makeNode(AlterSystemWALProhibitState);
+					n->walprohibited = $3;
+					$$ = (Node *)n;
+				}
 		;
 
+system_readonly_state:
+			 READ ONLY
+					{ $$ = true; }
+			| READ WRITE
+					{ $$ = false; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 53a511f1da8..e9cc4a22324 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -86,6 +86,7 @@ static void ProcessUtilitySlow(ParseState *pstate,
 							   DestReceiver *dest,
 							   QueryCompletion *qc);
 static void ExecDropStmt(DropStmt *stmt, bool isTopLevel);
+static void AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt);
 
 /*
  * CommandIsReadOnly: is an executable query read-only?
@@ -220,6 +221,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_IS_NOT_READ_ONLY;
 			}
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			{
 				/*
@@ -835,6 +837,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			AlterSystemSetConfigFile((AlterSystemStmt *) parsetree);
 			break;
 
+		case T_AlterSystemWALProhibitState:
+			PreventInTransactionBlock(isTopLevel, "ALTER SYSTEM");
+			AlterSystemSetWALProhibitState((AlterSystemWALProhibitState *) parsetree);
+			break;
+
 		case T_VariableSetStmt:
 			ExecSetVariableStmt((VariableSetStmt *) parsetree, isTopLevel);
 			break;
@@ -2818,6 +2825,7 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_REFRESH_MATERIALIZED_VIEW;
 			break;
 
+		case T_AlterSystemWALProhibitState:
 		case T_AlterSystemStmt:
 			tag = CMDTAG_ALTER_SYSTEM;
 			break;
@@ -3678,3 +3686,16 @@ GetCommandLogLevel(Node *parsetree)
 
 	return lev;
 }
+
+/*
+ * AlterSystemSetWALProhibitState
+ *
+ * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
+ */
+static void
+AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ALTER SYSTEM READ { ONLY | WRITE } not implemented")));
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6abcbea9634..cfa492c8d81 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1894,9 +1894,11 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER SERVER <name> VERSION <version> */
 	else if (Matches("ALTER", "SERVER", MatchAny, "VERSION", MatchAny))
 		COMPLETE_WITH("OPTIONS");
-	/* ALTER SYSTEM SET, RESET, RESET ALL */
+	/* ALTER SYSTEM READ, SET, RESET, RESET ALL */
 	else if (Matches("ALTER", "SYSTEM"))
-		COMPLETE_WITH("SET", "RESET");
+		COMPLETE_WITH("SET", "RESET", "READ");
+	else if (Matches("ALTER", "SYSTEM", "READ"))
+		COMPLETE_WITH("ONLY", "WRITE");
 	else if (Matches("ALTER", "SYSTEM", "SET|RESET"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_alter_system_set_vars);
 	else if (Matches("ALTER", "SYSTEM", "SET", MatchAny))
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index caed683ba92..5972800479f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -411,6 +411,7 @@ typedef enum NodeTag
 	T_RefreshMatViewStmt,
 	T_ReplicaIdentityStmt,
 	T_AlterSystemStmt,
+	T_AlterSystemWALProhibitState,
 	T_CreatePolicyStmt,
 	T_AlterPolicyStmt,
 	T_CreateTransformStmt,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc2bb40926a..6677f8ac470 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3211,6 +3211,16 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
+/* ----------------------
+ *		Alter System Read Statement
+ * ----------------------
+ */
+typedef struct AlterSystemWALProhibitState
+{
+	NodeTag		type;
+	bool		walprohibited;
+} AlterSystemWALProhibitState;
+
 /* ----------------------
  *		Cluster Statement (support pbrown's cluster index implementation)
  * ----------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fb57b8393f1..4ba4ec1cbdc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -86,6 +86,7 @@ AlterStatsStmt
 AlterSubscriptionStmt
 AlterSubscriptionType
 AlterSystemStmt
+AlterSystemWALProhibitState
 AlterTSConfigType
 AlterTSConfigurationStmt
 AlterTSDictionaryStmt
-- 
2.18.0

v12-0006-WIP-Documentation.patchapplication/x-patch; name=v12-0006-WIP-Documentation.patchDownload

From 3ffc2c05902414b950c44d962d16bdab45716ff8 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v12 6/6] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#79

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Amul Sul (#78)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:

To move development, testing, and the review forward, I have commented out the
code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
added the changes for the checkpointer so that system read-write state change
request can be processed as soon as possible, as suggested by Robert[1].

I have started a new thread[2] to understand the need for the CheckpointLock in
CreateCheckPoint() function. Until then we can continue work on this feature by
skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is
marked WIP.

Based on the favorable review comment from Andres upthread and also
your feedback, I committed 0001.

--
Robert Haas
EDB: http://www.enterprisedb.com

#80

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Amul Sul (#78)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:

To move development, testing, and the review forward, I have commented out the
code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
added the changes for the checkpointer so that system read-write state change
request can be processed as soon as possible, as suggested by Robert[1].

I am extremely doubtful about SetWALProhibitState()'s claim that "The
final state can only be requested by the checkpointer or by the
single-user so that there will be no chance that the server is already
in the desired final state." It seems like there is an obvious race
condition: CompleteWALProhibitChange() is called with a cur_state_gen
argument which embeds the last state we saw, but there's nothing to
keep it from changing between the time we saw it and the time that
function calls SetWALProhibitState(), is there? We aren't holding any
lock. It seems to me that SetWALProhibitState() needs to be rewritten
to avoid this assumption.

On a related note, SetWALProhibitState() has only two callers. One
passes is_final_state as true, and the other as false: it's never a
variable. The two cases are handled mostly differently. This doesn't
seem good. A lot of the logic in this function should probably be
moved to the calling sites, especially because it's almost certainly
wrong for this function to be basing what it does on the *current* WAL
prohibit state rather than the WAL prohibit state that was in effect
at the time we made the decision to call this function in the first
place. As I mentioned in the previous paragraph, that's a built-in
race condition. To put that another way, this function should NOT feel
free to call GetWALProhibitStateGen().

I don't really see why we should have both an SQL callable function
pg_alter_wal_prohibit_state() and also a DDL command for this. If
we're going to go with a functional interface, and I guess the idea of
that is to make it so GRANT EXECUTE works, then why not just get rid
of the DDL?

RequestWALProhibitChange() doesn't look very nice. It seems like it's
basically the second half of pg_alter_wal_prohibit_state(), not being
called from anywhere else. It doesn't seem to add anything to separate
it out like this; the interface between the two is not especially
clean.

It seems odd that ProcessWALProhibitStateChangeRequest() returns
without doing anything if !AmCheckpointerProcess(), rather than having
that be an Assert(). Why is it like that?

I think WALProhibitStateShmemInit() would probably look more similar
to other functions if it did if (found) { stuff; } rather than if
(!found) return; stuff; -- but I might be wrong about the existing
precedent.

The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff
seems confusingly-named, because we have other reasons for skipping a
checkpoint that are not what we're talking about here. I think this is
talking about whether we've performed a checkpoint after recovery, and
the naming should reflect that. But I think there's something else
wrong with the design, too: why is this protected by a spinlock? I
have questions in both directions. On the one hand, I wonder why we
need any kind of lock at all. On the other hand, if we do need a lock,
I wonder why a spinlock that protects only the setting and clearing of
the flag and nothing else is sufficient. There are zero comments
explaining what the idea behind this locking regime is, and I can't
understand why it should be correct.

In fact, I think this area needs a broader rethink. Like, the way you
integrated that stuff into StartupXLog(), it sure looks to me like we
might skip the checkpoint but still try to write other WAL records.
Before we reach the offending segment of code, we call
UpdateFullPageWrites(). Afterwards, we call XLogReportParameters().
Both of those are going to potentially write WAL. I guess you could
argue that's OK, on the grounds that neither function is necessarily
going to log anything, but I don't think I believe that. If I make my
server read only, take the OS down, change some GUCs, and then start
it again, I don't expect it to PANIC.

Also, I doubt that it's OK to skip the checkpoint as this code does
and then go ahead and execute recovery_end_command and update the
control file anyway. It sure looks like the existing code is written
with the assumption that the checkpoint happens before those other
things. One idea I just had was: suppose that, if the system is READ
ONLY, we don't actually exit recovery right away, and the startup
process doesn't exit. Instead we just sit there and wait for the
system to be made read-write again before doing anything else. But
then if hot_standby=false, there's no way for someone to execute a
ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which
seems bad. So perhaps we need to let in regular connections *as if*
the system were read-write while postponing not just the
end-of-recovery checkpoint but also the other associated things like
UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command,
control file update, etc. until the end of recovery. Or maybe that's
not the right idea either, but regardless of what we do here it needs
clear comments justifying it. The current version of the patch does
not have any.

I think that you've mis-positioned the check in autovacuum.c. Note
that the comment right afterwards says: "a worker finished, or
postmaster signaled failure to start a worker". Those are things we
should still check for even when the system is R/O. What we don't want
to do in that case is start new workers. I would suggest revising the
comment that starts with "There are some conditions that..." to
mention three conditions. The new one would be that the system is in a
read-only state. I'd mention that first, making the existing ones #2
and #3, and then add the code to "continue;" in that case right after
that comment, before setting current_time.

SendsSignalToCheckpointer() has multiple problems. As far as the name,
it should at least be "Send" rather than "Sends" but the corresponding
functions elsewhere have names like SendPostmasterSignal() not
SendSignalToPostmaster(). Also, why is it OK for it to use elog()
rather than ereport()? Also, why is it an error if the checkpointer's
not running, rather than just having the next checkpointer do it when
it's relaunched? Also, why pass SIGINT as an argument if there's only
one caller? A related thing that's also odd is that sending SIGINT
calls ReqCheckpointHandler() not anything specific to prohibiting WAL.
That is probably OK because that function now just sets the latch. But
then we could stop sending SIGINT to the checkpointer at all and just
send SIGUSR1, which would also set the latch, without using up a
signal. I wonder if we should make that change as a separate
preparatory patch. It seems like that would clear things up; it would
remove the oddity that this patch is invoking a handler called
ReqCheckpointerHandler() with no intention of requesting a checkpoint,
because ReqCheckpointerHandler() would be gone. That problem could
also be fixed by renaming ReqCheckpointerHandler() to something
clearer, but that seems inferior.

This is probably not a complete list of problems. Review from others
would be appreciated.

--
Robert Haas
EDB: http://www.enterprisedb.com

#81

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Robert Haas (#80)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jan 20, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:

To move development, testing, and the review forward, I have commented out the
code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
added the changes for the checkpointer so that system read-write state change
request can be processed as soon as possible, as suggested by Robert[1].

I am extremely doubtful about SetWALProhibitState()'s claim that "The
final state can only be requested by the checkpointer or by the
single-user so that there will be no chance that the server is already
in the desired final state." It seems like there is an obvious race
condition: CompleteWALProhibitChange() is called with a cur_state_gen
argument which embeds the last state we saw, but there's nothing to
keep it from changing between the time we saw it and the time that
function calls SetWALProhibitState(), is there? We aren't holding any
lock. It seems to me that SetWALProhibitState() needs to be rewritten
to avoid this assumption.

It is not like that, let me explain. When a user backend requests to alter WAL
prohibit state by using ASRO/ASRW DDL with the previous patch or calling
pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
set to the transition state i.e. going-read-only or going-read-write if it is
not already. If another backend trying to request the same alteration to the
wal prohibit state then nothing going to be changed in shared memory but that
backend needs to wait until the transition to the final wal prohibited state
completes. If a backend tries to request for the opposite state than the
previous which is in progress then it will see an error as "system state
transition to read only/write is already in progress". At a time only one
transition state can be set.

For the case where transition state changes to the complete states i.e.
read-only/read-write that can only be changed by the checkpointer or standalone
backend, there won't be any concurrency to change transition state to complete
state.

On a related note, SetWALProhibitState() has only two callers. One
passes is_final_state as true, and the other as false: it's never a
variable. The two cases are handled mostly differently. This doesn't
seem good. A lot of the logic in this function should probably be
moved to the calling sites, especially because it's almost certainly
wrong for this function to be basing what it does on the *current* WAL
prohibit state rather than the WAL prohibit state that was in effect
at the time we made the decision to call this function in the first
place. As I mentioned in the previous paragraph, that's a built-in
race condition. To put that another way, this function should NOT feel
free to call GetWALProhibitStateGen().

Understood. I have removed SetWALProhibitState() and moved the respective code
to the caller in the attached version.

I don't really see why we should have both an SQL callable function
pg_alter_wal_prohibit_state() and also a DDL command for this. If
we're going to go with a functional interface, and I guess the idea of
that is to make it so GRANT EXECUTE works, then why not just get rid
of the DDL?

Ok, dropped the patch of the DDL command. If in the future we want it back, I
can add that again.

Now, I am a little bit concerned about the current function name. How about
pg_set_wal_prohibit_state(bool) name or have two functions as
pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
other suggestions?

RequestWALProhibitChange() doesn't look very nice. It seems like it's
basically the second half of pg_alter_wal_prohibit_state(), not being
called from anywhere else. It doesn't seem to add anything to separate
it out like this; the interface between the two is not especially
clean.

Ok, moved that code in pg_alter_wal_prohibit_state() in the attached version.

It seems odd that ProcessWALProhibitStateChangeRequest() returns
without doing anything if !AmCheckpointerProcess(), rather than having
that be an Assert(). Why is it like that?

Like AbsorbSyncRequests().

I think WALProhibitStateShmemInit() would probably look more similar
to other functions if it did if (found) { stuff; } rather than if
(!found) return; stuff; -- but I might be wrong about the existing
precedent.

Ok, did the same in the attached version.

The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff
seems confusingly-named, because we have other reasons for skipping a
checkpoint that are not what we're talking about here. I think this is
talking about whether we've performed a checkpoint after recovery, and
the naming should reflect that. But I think there's something else
wrong with the design, too: why is this protected by a spinlock? I
have questions in both directions. On the one hand, I wonder why we
need any kind of lock at all. On the other hand, if we do need a lock,
I wonder why a spinlock that protects only the setting and clearing of
the flag and nothing else is sufficient. There are zero comments
explaining what the idea behind this locking regime is, and I can't
understand why it should be correct.

Renamed those functions to SetRecoveryCheckpointSkippedFlag() and
RecoveryCheckpointIsSkipped() respectively and remove the lock which is not
needed. Updated comment for lastRecoveryCheckpointSkipped variable for the lock
requirement.

In fact, I think this area needs a broader rethink. Like, the way you
integrated that stuff into StartupXLog(), it sure looks to me like we
might skip the checkpoint but still try to write other WAL records.
Before we reach the offending segment of code, we call
UpdateFullPageWrites(). Afterwards, we call XLogReportParameters().
Both of those are going to potentially write WAL. I guess you could
argue that's OK, on the grounds that neither function is necessarily
going to log anything, but I don't think I believe that. If I make my
server read only, take the OS down, change some GUCs, and then start
it again, I don't expect it to PANIC.

If you think that there will be panic when UpdateFullPageWrites() and/or
XLogReportParameters() tries to write WAL since the shared memory state for WAL
prohibited is set then it is not like that. For those functions, WAL write is
explicitly enabled by calling LocalSetXLogInsertAllowed().

I was under the impression that there won't be any problem if we allow the
writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be
considered as an exception since it is fine that this WAL record is not streamed
to standby while graceful failover, I may be wrong though.

Also, I doubt that it's OK to skip the checkpoint as this code does
and then go ahead and execute recovery_end_command and update the
control file anyway. It sure looks like the existing code is written
with the assumption that the checkpoint happens before those other
things.

Hmm, here we could go wrong. I need to look at this part carefully.

One idea I just had was: suppose that, if the system is READ
ONLY, we don't actually exit recovery right away, and the startup
process doesn't exit. Instead we just sit there and wait for the
system to be made read-write again before doing anything else. But
then if hot_standby=false, there's no way for someone to execute a
ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which
seems bad. So perhaps we need to let in regular connections *as if*
the system were read-write while postponing not just the
end-of-recovery checkpoint but also the other associated things like
UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command,
control file update, etc. until the end of recovery. Or maybe that's
not the right idea either, but regardless of what we do here it needs
clear comments justifying it. The current version of the patch does
not have any.

Will get back to you on this. Let me think more on this and the previous
point.

I think that you've mis-positioned the check in autovacuum.c. Note
that the comment right afterwards says: "a worker finished, or
postmaster signaled failure to start a worker". Those are things we
should still check for even when the system is R/O. What we don't want
to do in that case is start new workers. I would suggest revising the
comment that starts with "There are some conditions that..." to
mention three conditions. The new one would be that the system is in a
read-only state. I'd mention that first, making the existing ones #2
and #3, and then add the code to "continue;" in that case right after
that comment, before setting current_time.

Done.

SendsSignalToCheckpointer() has multiple problems. As far as the name,
it should at least be "Send" rather than "Sends" but the corresponding

"Sends" is unacceptable, it is a typo.

functions elsewhere have names like SendPostmasterSignal() not
SendSignalToPostmaster(). Also, why is it OK for it to use elog()
rather than ereport()? Also, why is it an error if the checkpointer's
not running, rather than just having the next checkpointer do it when
it's relaunched?

Ok, now the function only returns true or false. It's up to the caller what to
do with that. In our case, the caller will issue a warning only. If you want
this could be a NOTICE as well.

Also, why pass SIGINT as an argument if there's only
one caller?

I thoughts, anybody can also reuse it to send some other signal to the
checkpointer process in the future.

A related thing that's also odd is that sending SIGINT
calls ReqCheckpointHandler() not anything specific to prohibiting WAL.
That is probably OK because that function now just sets the latch. But
then we could stop sending SIGINT to the checkpointer at all and just
send SIGUSR1, which would also set the latch, without using up a
signal. I wonder if we should make that change as a separate
preparatory patch. It seems like that would clear things up; it would
remove the oddity that this patch is invoking a handler called
ReqCheckpointerHandler() with no intention of requesting a checkpoint,
because ReqCheckpointerHandler() would be gone. That problem could
also be fixed by renaming ReqCheckpointerHandler() to something
clearer, but that seems inferior.

I am not clear on this part. In the attached version I am sending SIGUSR1
instead of SIGINT, which works for me.

This is probably not a complete list of problems. Review from others
would be appreciated.

Thanks a lot.

The attached version does not address all your comments, I'll continue my work
on that.

Regards,
Amul

Attachments:

v13-0001-WIP-Implement-wal-prohibit-state-using-global-ba.patchapplication/x-patch; name=v13-0001-WIP-Implement-wal-prohibit-state-using-global-ba.patchDownload

From 234b46d195a2c3cd71a8467cec26453a2ee06823 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v13 1/3] WIP - Implement wal prohibit state using global
 barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited
    by calling pg_alter_wal_prohibit_state(true) sql function, the current
    state generation to inprogress in shared memory marked and signaled
    checkpointer process.  Checkpointer, noticing that the current state
    generation has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. altering WAL-Prohibited mode is restricted on standby server.

 8. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.

=====
TODO:
=====
 1. Commented CheckpointLock acquiring code in CreateCheckPoint() to make
    the latest changes to the checkpointer to process wal prohibit state
    change request ASAP.  We might want to depreciate  CheckpointLock
    completely from CreateCheckPoint(). The same discussion is started on
    pg_hackers[1]. Until then this patch mark as WIP.

 2. Tbc, skipping recovery checkpoint in StartupXLog() is correct?
    And the code right afterwards will work with that? More thought on Robert's
    comment on StartupXLog() changes[2]

====
REF:
====
 1. http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
 2. http://postgr.es/m/CA+TgmoZf2by_6kbY0JntGEmrSkj3DxUTNZDXORNasGdsSmkJjA@mail.gmail.com
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 399 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 109 ++++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 ++
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  92 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 725 insertions(+), 63 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..831b8c7d8ea
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,399 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitStateData
+{
+	/*
+	 * Indicates current WAL prohibit state generation and the last two bits of
+	 * this generation indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 shared_state_generation;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable walprohibit_cv;
+} WALProhibitStateData;
+
+static WALProhibitStateData *WALProhibitState = NULL;
+
+static void CompleteWALProhibitChange(uint32 cur_state_gen);
+static uint32 GetWALProhibitStateGen(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only. In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_alter_wal_prohibit_state()
+ *
+ * SQL callable function to alter system read write state.
+ */
+Datum
+pg_alter_wal_prohibit_state(PG_FUNCTION_ARGS)
+{
+	bool		walprohibited = PG_GETARG_BOOL(0);
+	uint32		cur_state_gen;
+	uint32		cur_state;
+	uint32		new_transition_state;
+
+	/* Alter WAL prohibit state not allowed during recovery */
+	PreventCommandDuringRecovery("pg_alter_wal_prohibit_state()");
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.
+	 */
+	new_transition_state = walprohibited?
+		WALPROHIBIT_STATE_GOING_READ_ONLY :
+		WALPROHIBIT_STATE_GOING_READ_WRITE;
+
+	cur_state_gen = GetWALProhibitStateGen();
+	cur_state = WALPROHIBIT_CURRENT_STATE(cur_state_gen);
+
+	/* Server is already in requested state */
+	if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
+		PG_RETURN_VOID();
+
+	/*
+	 * Increment the state generation counter in share memory if next wal
+	 * prohibit state being change not same the requested. Otherwise, wait for
+	 * transition state completion.
+	 */
+	if (new_transition_state != cur_state)
+	{
+		bool		success PG_USED_FOR_ASSERTS_ONLY;
+		uint32		next_state_gen;
+
+		/* Prevent concurrent contrary in progress transition state setting */
+		if (cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			if (cur_state == WALPROHIBIT_STATE_READ_ONLY)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+		}
+
+		/* Increment the state generation counter in share memory. */
+		next_state_gen = cur_state_gen + 1;
+		success = pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+												 &cur_state_gen, next_state_gen);
+		Assert(success);
+
+		/* Update our local state generation couter as well */
+		cur_state_gen = next_state_gen;
+
+		/* To be sure that any later reads of memory happen strictly after this. */
+		pg_memory_barrier();
+	}
+
+	/* Now must be in the requested transition state */
+	Assert(WALPROHIBIT_CURRENT_STATE(cur_state_gen) == new_transition_state);
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange(cur_state_gen);
+		PG_RETURN_VOID();
+	}
+
+	/* Signal checkpointer process */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();	/* no wait */
+	}
+
+	/* Wait for the state generation counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
+	for (;;)
+	{
+		/* We'll be done once wal prohibit state generation changes */
+		if (GetWALProhibitStateGen() != cur_state_gen)
+			break;
+
+		ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	}
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	uint32 		cur_state = WALPROHIBIT_CURRENT_STATE(GetWALProhibitStateGen());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(uint32 cur_state_gen)
+{
+	uint64		barrier_gen;
+	uint32		new_state;
+	bool		wal_prohibited;
+	bool		success PG_USED_FOR_ASSERTS_ONLY;
+
+	/*
+	 * Must be called from checkpointer. Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+	Assert(cur_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS);
+
+	/*
+	 * WAL prohibit state change is initiated. We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Since we do not process barrier for ourself, for now, return to "check"
+	 * state.
+	 */
+	ResetLocalXLogInsertAllowed();
+
+	/* And flush all inserts. */
+	XLogFlush(GetXLogInsertRecPtr());
+
+
+	/* Set the final state */
+	new_state = WALPROHIBIT_NEXT_STATE(cur_state_gen);
+
+	/*
+	 * There won't be any other process for the final state setting so that
+	 * the next final state will be the desired state.
+	 */
+	Assert((GetWALProhibitStateGen() & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
+		   (WALPROHIBIT_NEXT_STATE(GetWALProhibitStateGen()) == new_state));
+
+	/* Update new state generation in share memory. */
+	success = pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+											 &cur_state_gen, cur_state_gen + 1);
+	Assert(success);
+
+	/* To be sure that any later reads of memory happen strictly after this. */
+	pg_memory_barrier();
+
+	wal_prohibited = (new_state == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (RecoveryCheckpointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetRecoveryCheckpointSkippedFlag(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	uint32		wal_state_gen;
+
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		wal_state_gen = GetWALProhibitStateGen();
+
+		if (wal_state_gen & WALPROHIBIT_TRANSITION_IN_PROGRESS)
+		{
+			CompleteWALProhibitChange(wal_state_gen);
+			continue; /* check changed state */
+		}
+		else if (WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+				 WALPROHIBIT_STATE_READ_ONLY)
+		{
+			int			rc;
+
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write wal prohibit sate.
+			 */
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_WALPROHIBIT_STATE);
+
+			/*
+			 * If the postmaster dies or a shutdown request is received, just
+			 * bail out.
+			 */
+			if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+				return;
+
+			/* Re-check wal prohibit state */
+			continue;
+		}
+
+		Assert(WALPROHIBIT_CURRENT_STATE(wal_state_gen) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
+		break; /* Done */
+	}
+}
+
+/*
+ * GetWALProhibitStateGen()
+ *
+ * Atomically return the current server WAL prohibited state generation.
+ */
+static uint32
+GetWALProhibitStateGen(void)
+{
+	return pg_atomic_read_u32(&WALProhibitState->shared_state_generation);
+}
+
+/*
+ * WALProhibitStateGenerationInit()
+ *
+ * Initialization of shared wal prohibit state generation.
+ */
+void
+WALProhibitStateGenerationInit(bool wal_prohibited)
+{
+	uint32	new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibitState->shared_state_generation, new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibitState = (WALProhibitStateData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitStateData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibitState, 0, sizeof(WALProhibitStateData));
+		ConditionVariableInit(&WALProhibitState->walprohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd45..0de63af6365 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b331..d1f0ce9a3ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -734,6 +736,13 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastRecoveryCheckpointSkipped indicates if the last recovery checkpoint
+	 * is skipped. Lock protection is not needed since it isn't going to be read
+	 * and/or updated concurrently.
+	 */
+	bool		lastRecoveryCheckpointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6220,6 +6229,25 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetRecoveryCheckpointSkippedFlag(bool ChkptSkip)
+{
+	XLogCtl->lastRecoveryCheckpointSkipped = ChkptSkip;
+}
+
+/*
+ * Return value of lastRecoveryCheckpointSkipped flag.
+ */
+bool
+RecoveryCheckpointIsSkipped(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->lastRecoveryCheckpointSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7782,6 +7810,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateGenerationInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7792,7 +7826,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetRecoveryCheckpointSkippedFlag(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7988,6 +8032,7 @@ StartupXLOG(void)
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8038,6 +8083,16 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8253,9 +8308,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8274,9 +8329,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8298,6 +8364,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8587,9 +8659,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8602,6 +8678,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
@@ -8870,7 +8949,11 @@ CreateCheckPoint(int flags)
 	 * only one process that is allowed to issue checkpoints at any given
 	 * time.)
 	 */
-	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+	//======================================================================================
+	// TODO: Tbc, the discussion on this lock requirement is in progress at
+	// http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
+	//======================================================================================
+	//LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
 	/*
 	 * Prepare to accumulate statistics.
@@ -8941,7 +9024,7 @@ CreateCheckPoint(int flags)
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
-			LWLockRelease(CheckpointLock);
+			//LWLockRelease(CheckpointLock);			TODO: Tbc
 			END_CRIT_SECTION();
 			ereport(DEBUG1,
 					(errmsg("checkpoint skipped because system is idle")));
@@ -9242,7 +9325,7 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_removed,
 									 CheckpointStats.ckpt_segs_recycled);
 
-	LWLockRelease(CheckpointLock);
+	//LWLockRelease(CheckpointLock);	TODO: Tbc
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d78..b7c83f4b9e4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1525,6 +1525,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_alter_wal_prohibit_state(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 47e60ca5613..2d14b448eb5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * allowed.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf611..033f8a7bdd9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719dd..63d52825497 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4307,6 +4307,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b4..31383a11d08 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -604,24 +604,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index fe143151cc5..1c7b40563b5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1d81071c357..a5f8ced59e4 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca9..6d9c23f2d9f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2048,6 +2050,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12218,4 +12232,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..01785d293db
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,92 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateGenerationInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * The WAL Prohibit States.
+ *
+ * 	The odd number represents the transition state and whereas the even number
+ * 	represents the final state.  These states can be distinguished by checking
+ * 	the 0th bits aka transition bit.
+ */
+#define	WALPROHIBIT_STATE_READ_WRITE		(uint32) 0	/* WAL permitted */
+#define	WALPROHIBIT_STATE_GOING_READ_ONLY	(uint32) 1
+#define	WALPROHIBIT_STATE_READ_ONLY			(uint32) 2	/* WAL prohibited */
+#define	WALPROHIBIT_STATE_GOING_READ_WRITE	(uint32) 3
+
+/* The transition bit to distinguish states.  */
+#define	WALPROHIBIT_TRANSITION_IN_PROGRESS	((uint32) 1 << 0)
+
+/* Extract last two bits */
+#define	WALPROHIBIT_CURRENT_STATE(stateGeneration)	\
+	((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define	WALPROHIBIT_NEXT_STATE(stateGeneration)	\
+	WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..9857ab05c43 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetRecoveryCheckpointSkippedFlag(bool ChkptSkip);
+extern bool RecoveryCheckpointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a3..40bea0f4271 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11367,6 +11367,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'alter system read write state',
+  proname => 'pg_alter_wal_prohibit_state', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_alter_wal_prohibit_state' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87e..8f4fc4f1e15 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 943142ced8c..f295b05e760 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2697,6 +2697,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v13-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v13-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 6e0a961b7e39081ad48f123d29c18c4dc963507c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v13 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 26 ++++++++++++-----
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 459 insertions(+), 69 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff0..9e195e4bc6f 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f203bb594cd..7ad6e0036c3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1646,6 +1657,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	OffsetNumber offnum,
 				maxoff;
 	TransactionId latestRemovedXid = InvalidTransactionId;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1664,13 +1676,16 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			deletable[ndeletable++] = offnum;
 	}
 
-	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+	if (XLogStandbyInfoActive() && needwal)
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 												 deletable, ndeletable);
 
 	if (ndeletable > 0)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1687,7 +1702,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e127639..e0c483171ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index faffbb18658..6aa14c275d2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1916,6 +1917,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2200,6 +2203,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2761,6 +2766,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3513,6 +3520,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3686,6 +3695,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4619,6 +4630,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5410,6 +5423,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5568,6 +5583,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5676,6 +5693,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5796,6 +5815,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5826,6 +5846,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5836,7 +5860,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index e3a716a2a2f..e93c211da4f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad7..7370dc1f64d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1481,6 +1486,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1498,7 +1506,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3336039125..c840912d116 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c28..9a9e2fc2b3c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -185,6 +186,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -208,6 +210,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -220,7 +226,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -338,6 +344,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -383,6 +390,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -401,7 +412,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1148,6 +1159,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1279,6 +1293,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2065,6 +2081,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2153,6 +2170,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2204,7 +2225,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2297,6 +2318,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2511,6 +2533,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2590,7 +2616,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222e..d689473a713 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7dcfa023236..bc0dcddb9b2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b778324..91f2b18e367 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2264c2c849c..15cb9b9a25c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 831b8c7d8ea..cec1bb2d0ba 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0de63af6365..cdb18c47b0c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d1f0ce9a3ba..9d38d1766ce 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1039,7 +1039,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2876,9 +2876,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8932,6 +8934,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9196,6 +9200,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9356,6 +9363,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10020,7 +10029,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10034,10 +10043,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10059,8 +10068,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 033f8a7bdd9..e4ee43e4a41 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 20e50247ea4..693601de238 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v13-0003-WIP-Documentation.patchapplication/x-patch; name=v13-0003-WIP-Documentation.patchDownload

From e008b26960c269ddfa5ec192ded4523f09bed100 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v13 3/3] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#82

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Amul Sul (#81)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul@gmail.com> wrote:

It is not like that, let me explain. When a user backend requests to alter WAL
prohibit state by using ASRO/ASRW DDL with the previous patch or calling
pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
set to the transition state i.e. going-read-only or going-read-write if it is
not already. If another backend trying to request the same alteration to the
wal prohibit state then nothing going to be changed in shared memory but that
backend needs to wait until the transition to the final wal prohibited state
completes. If a backend tries to request for the opposite state than the
previous which is in progress then it will see an error as "system state
transition to read only/write is already in progress". At a time only one
transition state can be set.

Hrm. Well, then that needs to be abundantly clear in the relevant comments.

Now, I am a little bit concerned about the current function name. How about
pg_set_wal_prohibit_state(bool) name or have two functions as
pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
other suggestions?

How about pg_prohibit_wal(true|false)?

It seems odd that ProcessWALProhibitStateChangeRequest() returns
without doing anything if !AmCheckpointerProcess(), rather than having
that be an Assert(). Why is it like that?

Like AbsorbSyncRequests().

Well, that can be called not from the checkpointer, according to the
comments. Specifically from the postmaster, I guess. Again, comments
please.

If you think that there will be panic when UpdateFullPageWrites() and/or
XLogReportParameters() tries to write WAL since the shared memory state for WAL
prohibited is set then it is not like that. For those functions, WAL write is
explicitly enabled by calling LocalSetXLogInsertAllowed().

I was under the impression that there won't be any problem if we allow the
writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be
considered as an exception since it is fine that this WAL record is not streamed
to standby while graceful failover, I may be wrong though.

I don't think that's OK. I mean, the purpose of the feature is to
prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
the principle of least surprise.

I am not clear on this part. In the attached version I am sending SIGUSR1
instead of SIGINT, which works for me.

OK.

The attached version does not address all your comments, I'll continue my work
on that.

Some thoughts on this version:

+/* Extract last two bits */
+#define        WALPROHIBIT_CURRENT_STATE(stateGeneration)      \
+       ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define        WALPROHIBIT_NEXT_STATE(stateGeneration) \
+       WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))

This is really confusing. First, the comment looks like it applies to
both based on how it is positioned, but that's clearly not true.
Second, the naming is really hard to understand. Third, there don't
seem to be comments explaining the theory of what is going on here.
Fourth, stateGeneration refers not to which generation of state we've
got here but to the combination of the state and the generation.
However, it's not clear that we ever really use the generation for
anything.

I think that the direction you went with this is somewhat different
from what I had in mind. That may be OK, but let me just explain the
difference. We both had in mind the idea that the low two bits of the
state would represent the current state and the upper bits would
represent the state generation. However, I wasn't necessarily
imagining that the only supported operation was making the combined
value go up by 1. For instance, I had thought that perhaps the effect
of trying to go read-only when we're in the middle of going read-write
would be to cancel the previous operation and start the new one. What
you have instead is that it errors out. So in your model a change
always has to finish before the next one can start, which in turn
means that the sequence is completely linear. In my idea the
state+generation might go from say 1 to 7, because trying to go
read-write would cancel the previous attempt to go read-only and
replace it with an attempt to go the other direction, and from 7 we
might go to to 9 if somebody now tries to go read-only again before
that finishes. In your model, there's never any sort of cancellation
of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.

One disadvantage of the way you've got it from a user perspective is
that if I'm writing a tool, I might get an error telling me that the
state change I'm trying to make is already in progress, and then I
have to retry. With the other design, I might attempt a state change
and have it fail because the change can't be completed, but I won't
ever fail because I attempt a state change and it can't be started
because we're in the wrong starting state. So, with this design, as
the tool author, I may not be able to just say, well, I tried to
change the state and it didn't work, so report the error to the user.
I think with the other approach that would be more viable. But I might
be wrong here; it would be interesting to hear what other people
think.

I dislike the use of the term state_gen or StateGen to refer to the
combination of a state and a generation. That seems unintuitive. I'm
tempted to propose that we just call it a counter, and, assuming we
stick with the design as you now have it, explain it with a comment
like this in walprohibit.h:

"There are four possible states. A brand new database cluster is
always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to
make it read only, then we enter the state
WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we
enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently
tries to make it read write, we will enter the state
WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete,
we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state
transitions are the only ones possible; for example, if we're
currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go
read-write will produce an error, and a second attempt to go read-only
will not cause a state change. Thus, we can represent the state as a
shared-memory counter that whose value only ever changes by adding 1.
The initial value at postmaster startup is either 0 or 2, depending on
whether the control file specifies the the system is starting
read-only or read-write."

And then maybe change all the state_gen references to reference
wal_prohibit_counter or, where a shorter name is appropriate, counter.

I think this might be clearer if we used different data types for the
state and the state/generation combination, with functions to convert
between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0
etc. maybe do:

typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState;

And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like:

static inline WALProhibitState
GetWALProhibitState(uint32 wal_prohibit_counter)
{
return (WALProhibitState) (wal_prohibit_counter & 3);
}

I don't really know why we need WALPROHIBIT_NEXT_STATE at all,
honestly. It's just a macro to add 1 to an integer. And you don't even
use it consistently. Like pg_alter_wal_prohibit_state() does this:

+       /* Server is already in requested state */
+       if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
+               PG_RETURN_VOID();

But then later does this:

+ next_state_gen = cur_state_gen + 1;

Which is exactly the same thing as what you computed above using
WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly
sure how to structure this to make it as simple as possible, but I
don't think this is it.

Honestly this whole logic here seems correct but a bit hard to follow.
Like, maybe:

wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter);
switch (GetWALProhibitState(wal_prohibit_counter))
{
case WALPROHIBIT_STATE_READ_WRITE:
if (!walprohibit) return;
increment = true;
break;
case WALPROHIBIT_STATE_GOING_READ_WRITE:
if (walprohibit) ereport(ERROR, ...);
break;
...
}

And then just:

if (increment)
wal_prohibit_counter =
pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1);
target_counter_value = wal_prohibit_counter + 1;
// random stuff
// eventually wait until the counter reaches >= target_counter_value

This might not be exactly the right idea though. I'm just looking for
a way to make it clearer, because I find it a bit hard to understand
right now. Maybe you or someone else will have a better idea.

+               success =
pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+
                          &cur_state_gen, next_state_gen);
+               Assert(success);

I am almost positive that this is not OK. I think on some platforms
atomics just randomly fail some percentage of the time. You always
need a retry loop. Anyway, what happens if two people enter this
function at the same time and both read the same starting counter
value before either does anything?

+               /* To be sure that any later reads of memory happen
strictly after this. */
+               pg_memory_barrier();

You don't need a memory barrier after use of an atomic. The atomic
includes a barrier.

--
Robert Haas
EDB: http://www.enterprisedb.com

#83

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Robert Haas (#82)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Jan 26, 2021 at 2:38 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul@gmail.com> wrote:

It is not like that, let me explain. When a user backend requests to alter WAL
prohibit state by using ASRO/ASRW DDL with the previous patch or calling
pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
set to the transition state i.e. going-read-only or going-read-write if it is
not already. If another backend trying to request the same alteration to the
wal prohibit state then nothing going to be changed in shared memory but that
backend needs to wait until the transition to the final wal prohibited state
completes. If a backend tries to request for the opposite state than the
previous which is in progress then it will see an error as "system state
transition to read only/write is already in progress". At a time only one
transition state can be set.

Hrm. Well, then that needs to be abundantly clear in the relevant comments.

Now, I am a little bit concerned about the current function name. How about
pg_set_wal_prohibit_state(bool) name or have two functions as
pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
other suggestions?

How about pg_prohibit_wal(true|false)?

LGTM. Used this.

It seems odd that ProcessWALProhibitStateChangeRequest() returns
without doing anything if !AmCheckpointerProcess(), rather than having
that be an Assert(). Why is it like that?

Like AbsorbSyncRequests().

Well, that can be called not from the checkpointer, according to the
comments. Specifically from the postmaster, I guess. Again, comments
please.

Done.

If you think that there will be panic when UpdateFullPageWrites() and/or
XLogReportParameters() tries to write WAL since the shared memory state for WAL
prohibited is set then it is not like that. For those functions, WAL write is
explicitly enabled by calling LocalSetXLogInsertAllowed().

I was under the impression that there won't be any problem if we allow the
writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be
considered as an exception since it is fine that this WAL record is not streamed
to standby while graceful failover, I may be wrong though.

I don't think that's OK. I mean, the purpose of the feature is to
prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
the principle of least surprise.

Yes, you are correct.

I am still on this. The things that worried me here are the wal records sequence
being written in the startup process -- UpdateFullPageWrites() generate record
just before the recovery check-point record and XLogReportParameters() just
after that but before any other backend could write any wal record. We might
also need to follow the same sequence while changing the system to read-write.

But in our case maintaining this sequence seems to be a little difficult. let me
explain, when a backend executes a function (ie. pg_prohibit_wal(false)) to
make the system read-write then that system state changes will be conveyed by
the Checkpointer process to all existing backends using global barrier and then
checkpoint might want to write those records. While checkpoint in progress, few
existing backends who might have absorbed barriers can write new records that
might come before aforesaid wal record sequence to be written. Also, we might
think that we could write these records before emitting the super barrier which
also might not solve the problem because a new backend could connect the server
just after the read-write system state change request was made but before
Checkpointer could pick that. Such a backend could write WAL before the
Checkpointer could, (see IsWALProhibited()).

Apart from this I also had a thought on the point recovery_end_command execution
that happens just after the recovery end checkpoint in the Startup process.
I think, first of all, why should we go and execute this command if we are
read-only? I don't think there will be any use to boot-up a read-only server
as standby, which itself is read-only to some extent. Also, pg_basebackup from
read-only is not allowed, a new standby cannot be set up. I think,
IMHO, we should simply error-out if tried to boot-up read-only server as
standby using standby.signal file, thoughts?

I am not clear on this part. In the attached version I am sending SIGUSR1
instead of SIGINT, which works for me.

OK.

The attached version does not address all your comments, I'll continue my work
on that.

Some thoughts on this version:
+/* Extract last two bits */
+#define        WALPROHIBIT_CURRENT_STATE(stateGeneration)      \
+       ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define        WALPROHIBIT_NEXT_STATE(stateGeneration) \
+       WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
This is really confusing. First, the comment looks like it applies to
both based on how it is positioned, but that's clearly not true.
Second, the naming is really hard to understand. Third, there don't
seem to be comments explaining the theory of what is going on here.
Fourth, stateGeneration refers not to which generation of state we've
got here but to the combination of the state and the generation.
However, it's not clear that we ever really use the generation for
anything.

I think that the direction you went with this is somewhat different
from what I had in mind. That may be OK, but let me just explain the
difference. We both had in mind the idea that the low two bits of the
state would represent the current state and the upper bits would
represent the state generation. However, I wasn't necessarily
imagining that the only supported operation was making the combined
value go up by 1. For instance, I had thought that perhaps the effect
of trying to go read-only when we're in the middle of going read-write
would be to cancel the previous operation and start the new one. What
you have instead is that it errors out. So in your model a change
always has to finish before the next one can start, which in turn
means that the sequence is completely linear. In my idea the
state+generation might go from say 1 to 7, because trying to go
read-write would cancel the previous attempt to go read-only and
replace it with an attempt to go the other direction, and from 7 we
might go to to 9 if somebody now tries to go read-only again before
that finishes. In your model, there's never any sort of cancellation
of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.

Yes, that made implementation quite simple. I was under the impression that we
might not have that much concurrency that so many backends might be trying to
change the system state so quickly.

One disadvantage of the way you've got it from a user perspective is
that if I'm writing a tool, I might get an error telling me that the
state change I'm trying to make is already in progress, and then I
have to retry. With the other design, I might attempt a state change
and have it fail because the change can't be completed, but I won't
ever fail because I attempt a state change and it can't be started
because we're in the wrong starting state. So, with this design, as
the tool author, I may not be able to just say, well, I tried to
change the state and it didn't work, so report the error to the user.
I think with the other approach that would be more viable. But I might
be wrong here; it would be interesting to hear what other people
think.

Thinking a little bit more, I agree that your approach is more viable as it can
cancel previously in-progress state.

For e.g. in a graceful failure future, the master might have detected that he
lost the connection to all standby and immediately calls the function to change
the system state to read-only. But, it regains the connection soon and wants to
back to read-write then it might need to wait until the previous state
completion. That might be the worst if the system is quite busy and/or any
backend which might have stuck or too busy and could not absorb the barrier.

If you want, I try to change the way you have thought, in the next version.

I dislike the use of the term state_gen or StateGen to refer to the
combination of a state and a generation. That seems unintuitive. I'm
tempted to propose that we just call it a counter, and, assuming we
stick with the design as you now have it, explain it with a comment
like this in walprohibit.h:

"There are four possible states. A brand new database cluster is
always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to
make it read only, then we enter the state
WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we
enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently
tries to make it read write, we will enter the state
WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete,
we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state
transitions are the only ones possible; for example, if we're
currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go
read-write will produce an error, and a second attempt to go read-only
will not cause a state change. Thus, we can represent the state as a
shared-memory counter that whose value only ever changes by adding 1.
The initial value at postmaster startup is either 0 or 2, depending on
whether the control file specifies the the system is starting
read-only or read-write."

Thanks, added the same.

And then maybe change all the state_gen references to reference
wal_prohibit_counter or, where a shorter name is appropriate, counter.

Done.

I think this might be clearer if we used different data types for the
state and the state/generation combination, with functions to convert
between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0
etc. maybe do:

typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState;

And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like:

static inline WALProhibitState
GetWALProhibitState(uint32 wal_prohibit_counter)
{
return (WALProhibitState) (wal_prohibit_counter & 3);
}

Done.

I don't really know why we need WALPROHIBIT_NEXT_STATE at all,
honestly. It's just a macro to add 1 to an integer. And you don't even
use it consistently. Like pg_alter_wal_prohibit_state() does this:
+       /* Server is already in requested state */
+       if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
+               PG_RETURN_VOID();
But then later does this:

+ next_state_gen = cur_state_gen + 1;

Which is exactly the same thing as what you computed above using
WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly
sure how to structure this to make it as simple as possible, but I
don't think this is it.

Honestly this whole logic here seems correct but a bit hard to follow.
Like, maybe:

wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter);
switch (GetWALProhibitState(wal_prohibit_counter))
{
case WALPROHIBIT_STATE_READ_WRITE:
if (!walprohibit) return;
increment = true;
break;
case WALPROHIBIT_STATE_GOING_READ_WRITE:
if (walprohibit) ereport(ERROR, ...);
break;
...
}

And then just:

if (increment)
wal_prohibit_counter =
pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1);
target_counter_value = wal_prohibit_counter + 1;
// random stuff
// eventually wait until the counter reaches >= target_counter_value

This might not be exactly the right idea though. I'm just looking for
a way to make it clearer, because I find it a bit hard to understand
right now. Maybe you or someone else will have a better idea.

Yeah, this makes code much cleaner than before, did the same in the attached
version. Thanks again.

+               success =
pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+
&cur_state_gen, next_state_gen);
+               Assert(success);
I am almost positive that this is not OK. I think on some platforms
atomics just randomly fail some percentage of the time. You always
need a retry loop. Anyway, what happens if two people enter this
function at the same time and both read the same starting counter
value before either does anything?
+               /* To be sure that any later reads of memory happen
strictly after this. */
+               pg_memory_barrier();
You don't need a memory barrier after use of an atomic. The atomic
includes a barrier.

Understood, removed.

Regards,
Amul

Attachments:

v14-0001-WIP-Implement-wal-prohibit-state-using-global-ba.patchapplication/x-patch; name=v14-0001-WIP-Implement-wal-prohibit-state-using-global-ba.patchDownload

From cadca1be89e56476238ff28bed2176025f8be137 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v14 1/3] WIP - Implement wal prohibit state using global
 barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end of recovery checkpoint
    will be skipped and it will be performed when the system changed to
    WAL-Permitted mode.

 7. altering WAL-Prohibited mode is restricted on standby server.

 8. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.

=====
TODO:
=====
 1. Tbc, skipping recovery checkpoint in StartupXLog() is correct?
    And the code right afterwards will work with that? More thought on Robert's
    comment on StartupXLog() changes[1]

====
REF:
====
 1. http://postgr.es/m/CA+TgmoZf2by_6kbY0JntGEmrSkj3DxUTNZDXORNasGdsSmkJjA@mail.gmail.com
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 397 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        |  99 +++++-
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 ++
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 728 insertions(+), 60 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..b48cf71e0c0
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,397 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();		/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();		/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.  Signal the checkpointer to do that and update the shared
+	 * memory wal prohibit state counter.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();	/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+	{
+		/*
+		 * Request checkpoint if the end-of-recovery checkpoint has been skipped
+		 * previously.
+		 */
+		if (RecoveryCheckpointIsSkipped())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+			SetRecoveryCheckpointSkippedFlag(false);
+		}
+		ereport(LOG, (errmsg("system is now read write")));
+	}
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	WALProhibitState cur_state;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be sure
+	 * it has processed all pending wal prohibit state change requests as soon
+	 * as possible.  Since CreateCheckPoint and ProcessSyncRequests sometimes
+	 * runs in non-checkpointer processes, do nothing if not checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	while (cur_state != WALPROHIBIT_STATE_READ_WRITE)
+	{
+		if (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+			cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE)
+		{
+			CompleteWALProhibitChange();
+		}
+		else if (cur_state == WALPROHIBIT_STATE_READ_ONLY)
+		{
+			int			rc;
+
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write state.
+			 */
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_WALPROHIBIT_STATE);
+
+			/*
+			 * If the postmaster dies or a shutdown request is received, just
+			 * bail out.
+			 */
+			if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+				return;
+		}
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd45..0de63af6365 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 236a66f6387..f99601b7a78 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -730,6 +732,13 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * lastRecoveryCheckpointSkipped indicates if the last recovery checkpoint
+	 * is skipped. Lock protection is not needed since it isn't going to be read
+	 * and/or updated concurrently.
+	 */
+	bool		lastRecoveryCheckpointSkipped;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6216,6 +6225,25 @@ SetCurrentChunkStartTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Set or unset flag to indicating that the last checkpoint has been skipped.
+ */
+void
+SetRecoveryCheckpointSkippedFlag(bool ChkptSkip)
+{
+	XLogCtl->lastRecoveryCheckpointSkipped = ChkptSkip;
+}
+
+/*
+ * Return value of lastRecoveryCheckpointSkipped flag.
+ */
+bool
+RecoveryCheckpointIsSkipped(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->lastRecoveryCheckpointSkipped;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  * Startup process maintains an accurate local copy in XLogReceiptTime
@@ -7784,6 +7812,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
@@ -7794,7 +7828,17 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
+	 */
+	if (ControlFile->wal_prohibited && InRecovery)
+	{
+		SetRecoveryCheckpointSkippedFlag(true);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else if (InRecovery)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7987,6 +8031,7 @@ StartupXLOG(void)
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8037,6 +8082,16 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool wal_prohibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = wal_prohibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8252,9 +8307,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8273,9 +8328,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8297,6 +8363,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8586,9 +8658,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8601,6 +8677,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d78..da154254a4d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1525,6 +1525,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 47e60ca5613..9498a8ff69a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf611..033f8a7bdd9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719dd..63d52825497 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4307,6 +4307,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b4..31383a11d08 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -604,24 +604,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index fe143151cc5..1c7b40563b5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1d81071c357..a5f8ced59e4 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118ed..8fb43cc55ca 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2048,6 +2050,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12218,4 +12232,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..9857ab05c43 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,8 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern void SetRecoveryCheckpointSkippedFlag(bool ChkptSkip);
+extern bool RecoveryCheckpointIsSkipped(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +328,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a3..f3fc971b247 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11367,6 +11367,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87e..8f4fc4f1e15 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 721b230bf29..daa782f24c9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2697,6 +2697,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v14-0003-WIP-Documentation.patchapplication/x-patch; name=v14-0003-WIP-Documentation.patchDownload

From 4097ec411507334311b32a3217fa1ffd477276aa Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v14 3/3] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v14-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v14-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 88ca146fe3affc303df6be21e0bd4f024f92f515 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v14 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 26 ++++++++++++-----
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 459 insertions(+), 69 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff0..9e195e4bc6f 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e127639..e0c483171ed 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -580,6 +585,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -634,6 +640,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -645,7 +654,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9926e2bd546..b9b80b2b074 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1957,6 +1958,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2280,6 +2283,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2841,6 +2846,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3593,6 +3600,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3766,6 +3775,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4699,6 +4710,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5490,6 +5503,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5648,6 +5663,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5756,6 +5773,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5876,6 +5895,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5906,6 +5926,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5916,7 +5940,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index e3a716a2a2f..e93c211da4f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -232,6 +233,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -286,6 +288,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -319,7 +325,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad7..7370dc1f64d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1481,6 +1486,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1498,7 +1506,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3336039125..c840912d116 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c28..9a9e2fc2b3c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -185,6 +186,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -208,6 +210,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -220,7 +226,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -338,6 +344,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -383,6 +390,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -401,7 +412,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1148,6 +1159,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1279,6 +1293,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2065,6 +2081,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2153,6 +2170,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2204,7 +2225,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2297,6 +2318,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2511,6 +2533,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2590,7 +2616,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222e..d689473a713 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7dcfa023236..bc0dcddb9b2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b778324..91f2b18e367 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1106,6 +1107,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2196,6 +2199,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2286,6 +2292,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2264c2c849c..15cb9b9a25c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index b48cf71e0c0..e85f0fe8ccb 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0de63af6365..cdb18c47b0c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f99601b7a78..5cc81c3fe94 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1035,7 +1035,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2872,9 +2872,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8931,6 +8933,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9182,6 +9186,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9339,6 +9346,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9993,7 +10002,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10007,10 +10016,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10032,8 +10041,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 033f8a7bdd9..e4ee43e4a41 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 20e50247ea4..693601de238 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#84

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Amul Sul (#83)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jan 28, 2021 at 7:17 AM Amul Sul <sulamul@gmail.com> wrote:

I am still on this. The things that worried me here are the wal records sequence
being written in the startup process -- UpdateFullPageWrites() generate record
just before the recovery check-point record and XLogReportParameters() just
after that but before any other backend could write any wal record. We might
also need to follow the same sequence while changing the system to read-write.

I was able to chat with Andres about this topic for a while today and
he made some proposals which seemed pretty good to me. I can't promise
that what I'm about to write is an entirely faithful representation of
what he said, but hopefully it's not so far off that he gets mad at me
or something.

1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with.

2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments.

3. If the system is starting up read only (and the read-only state
didn't get cleared because of #1 above) then don't call
XLogAcceptWrites() at the end of StartupXLOG() and instead have the
checkpointer do it later when the system is going read-write for the
first time.

--
Robert Haas
EDB: http://www.enterprisedb.com

#85

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Robert Haas (#84)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On 2021-02-16 17:11:06 -0500, Robert Haas wrote:

I can't promise that what I'm about to write is an entirely faithful
representation of what he said, but hopefully it's not so far off that
he gets mad at me or something.

Seems accurate - and also I'm way too tired that I'd be mad ;)

1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with.

It seems also to make sense from a behaviour POV to me: Imagine a
"smooth" planned failover with ASRO:
1) ASRO on primary
2) promote standby
3) edit primary config to include primary_conninfo, add standby.signal
4) restart "read only primary"

There's not really any spot in which it'd be useful to do disable ASRO,
right? But 4) should make the node a normal standby.

Greetings,

Andres Freund

#86

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Andres Freund (#85)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Feb 17, 2021 at 7:50 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-02-16 17:11:06 -0500, Robert Haas wrote:

Thank you very much to both of you !

I can't promise that what I'm about to write is an entirely faithful
representation of what he said, but hopefully it's not so far off that
he gets mad at me or something.

Seems accurate - and also I'm way too tired that I'd be mad ;)

1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with.

It seems also to make sense from a behaviour POV to me: Imagine a
"smooth" planned failover with ASRO:
1) ASRO on primary
2) promote standby
3) edit primary config to include primary_conninfo, add standby.signal
4) restart "read only primary"

There's not really any spot in which it'd be useful to do disable ASRO,
right? But 4) should make the node a normal standby.

Understood.

In the attached version I have made the changes accordingly what Robert has
summarised in his previous mail[1].

In addition to that, I also move the code that updates the control file to
XLogAcceptWrites() which will also get skipped when the system is read-only (wal
prohibited). The system will be in the crash recovery, and that will
change once we do the end-of-recovery checkpoint and the WAL writes operation
which we were skipping from startup. The benefit of keeping the system in
recovery mode is that it fixes my concern[2] where other backends could connect
and write wal records while we were changing the system to read-write. Now, no
other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery
checkpoint, and XLogReportParameters() operations will be performed in the same
sequence as it is in the startup while changing the system to read-write.

Regards,
Amul

1] /messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com
2] /messages/by-id/CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com

Attachments:

v15-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v15-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 791cbcc97a896f425155bc5d6126d1721f43a737 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v15 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 404 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 336 ++++++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 ++
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 894 insertions(+), 138 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..1de27529b69
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,404 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only (wal
+	 * prohibited) server, which should be completed before changing the system
+	 * state to read write.  To disallow any other backend from writing a wal
+	 * record before the end of crash recovery checkpoint finishes, we let the
+	 * server in recovery mode.
+	 */
+	if (!StartupCrashRecoveryIsPending())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();		/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();		/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.  Checkpointer will do that and update the shared memory wal
+	 * prohibit state counter.
+	 *
+	 * If the end-of-recovery checkpoint and required wal write to start the
+	 * server normally, has been skipped previously, then do that now.
+	 */
+	if (StartupCrashRecoveryIsPending())
+	{
+		/* Should to be here while changing system to read write. */
+		Assert(!walprohibit);
+		PerformPendingStartupOperations();
+	}
+	else if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();	/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	WALProhibitState cur_state;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be sure
+	 * it has processed all pending wal prohibit state change requests as soon
+	 * as possible.  Since CreateCheckPoint and ProcessSyncRequests sometimes
+	 * runs in non-checkpointer processes, do nothing if not checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	while (cur_state != WALPROHIBIT_STATE_READ_WRITE)
+	{
+		if (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+			cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE)
+		{
+			CompleteWALProhibitChange();
+		}
+		else if (cur_state == WALPROHIBIT_STATE_READ_ONLY)
+		{
+			int			rc;
+
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write state.
+			 */
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_WALPROHIBIT_STATE);
+
+			/*
+			 * If the postmaster dies or a shutdown request is received, just
+			 * bail out.
+			 */
+			if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+				return;
+		}
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 17fbc41bbb7..e37dbada4db 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e3b5df7dcb..0df545fc612 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -730,6 +732,14 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * startupCrashRecoveryPending indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.  Lock
+	 * protection is not needed since it isn't going to be read and/or updated
+	 * concurrently.
+	 */
+	bool		startupCrashRecoveryPending;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -978,6 +988,13 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+							 bool localPromoteIsTriggered,
+							 XLogReaderState *xlogreader,
+							 bool archiveRecoveryRequested,
+							 TimeLineID endOfLogTLI, XLogRecPtr endOfLog,
+							 TimeLineID thisTimeLineID);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6187,6 +6204,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of startupCrashRecoveryPending flag.
+ */
+bool
+StartupCrashRecoveryIsPending(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->startupCrashRecoveryPending;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6392,6 +6419,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6541,13 +6569,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only (wal
+		 * prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7785,16 +7822,130 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
 	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	 /*
+	  * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will be
+	  * written later in XLogAcceptWrites.
+	  */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->startupCrashRecoveryPending = true;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+	{
+		promoted = XLogAcceptWrites(needChkpt, bgwriterLaunched,
+									LocalPromoteIsTriggered, xlogreader,
+									ArchiveRecoveryRequested,
+									EndOfLogTLI, EndOfLog, ThisTimeLineID);
+	}
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  Only the Startup process can call this function directly.
+ */
+static bool
+XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+				 bool localPromoteIsTriggered, XLogReaderState *xlogreader,
+				 bool archiveRecoveryRequested, TimeLineID endOfLogTLI,
+				 XLogRecPtr endOfLog, TimeLineID thisTimeLineID)
+{
+	bool		promoted = false;
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7812,15 +7963,17 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
-			if (LocalPromoteIsTriggered)
+			if (localPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7849,8 +8002,10 @@ StartupXLOG(void)
 			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
 	}
 
-	if (ArchiveRecoveryRequested)
+	if (archiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7868,7 +8023,7 @@ StartupXLOG(void)
 		 * pre-allocated files containing garbage. In any case, they are not
 		 * part of the new timeline's history so we don't need them.
 		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+		RemoveNonParentXlogFiles(endOfLog, thisTimeLineID);
 
 		/*
 		 * If the switch happened in the middle of a segment, what to do with
@@ -7899,14 +8054,14 @@ StartupXLOG(void)
 		 * restored from the archive to begin with, it's expected to have a
 		 * .done file).
 		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		if (XLogSegmentOffset(endOfLog, wal_segment_size) != 0 &&
 			XLogArchivingActive())
 		{
 			char		origfname[MAXFNAMELEN];
 			XLogSegNo	endLogSegNo;
 
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			XLByteToPrevSeg(endOfLog, endLogSegNo, wal_segment_size);
+			XLogFileName(origfname, endOfLogTLI, endLogSegNo, wal_segment_size);
 
 			if (!XLogArchiveIsReadyOrDone(origfname))
 			{
@@ -7914,7 +8069,7 @@ StartupXLOG(void)
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
 
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+				XLogFilePath(origpath, endOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
 				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
 
@@ -7930,63 +8085,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8020,20 +8125,51 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
+	return promoted;
+}
+
+/*
+ * This function should be called only if wal write accepts operation in the
+ * startup process had skipped due to read only system state (wal prohibited)
+ * and should be called only once while changing the system to read write.
+ */
+void
+PerformPendingStartupOperations(void)
+{
+	Assert(StartupCrashRecoveryIsPending());
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * When we do skip the end of recovery checkpoint we always have
+	 * InRecovery = true, for more detail see the place where
+	 * startupCrashRecoveryPending flag is set in StartupXLOG. Now we are
+	 * performing this operation here which means we do have got the
+	 * necessary auxiliary process therefore bgwriterLaunched is also true.
+	 * This end of recovery checkpoint will never be skipped if
+	 * ArchiveRecoveryRequested = true, at that time system implicitly get
+	 * out for the wal prohibit state and does allows all the wal write
+	 * operation in the startup. Therefore ArchiveRecoveryRequested is false
+	 * here, and value for the rest of the parameters will be inapplicable.
 	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	(void) XLogAcceptWrites(true,				/* needChkpt */
+							true,				/* bgwriterLaunched */
+							false,				/* localPromoteIsTriggered */
+							NULL,				/* xlogreader */
+							false,				/* archiveRecoveryRequested */
+							0,					/* endOfLogTLI */
+							InvalidXLogRecPtr,	/* endOfLog */
+							0);					/* thisTimeLineID */
+
+	XLogCtl->startupCrashRecoveryPending = false;
+}
+
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
 }
 
 /*
@@ -8251,9 +8387,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8272,9 +8408,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8296,6 +8443,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8585,9 +8738,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8600,6 +8757,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d78..da154254a4d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1525,6 +1525,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 8da5e5c9c39..0fb9748a527 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf611..033f8a7bdd9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719dd..63d52825497 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4307,6 +4307,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b4..31383a11d08 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -604,24 +604,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index fe143151cc5..1c7b40563b5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118ed..8fb43cc55ca 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2048,6 +2050,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12218,4 +12232,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f701..922cd9641d8 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile->backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..7bff0adc2cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,7 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool StartupCrashRecoveryIsPending(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +327,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1487710d590..62b8ac41702 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11379,6 +11379,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87e..8f4fc4f1e15 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3adb3b..cd89ff06790 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2698,6 +2698,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v15-0003-WIP-Documentation.patchapplication/x-patch; name=v15-0003-WIP-Documentation.patchDownload

From 3cac07047fc3e3d8fe3bd32a5e3bc64fbd39bdc2 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v15 3/3] WIP - Documentation.

TODOs:

1] Documentation regarding ALTER SYSTEM READ/WRITE
---
 src/backend/access/transam/README | 60 ++++++++++++++++++++++++++++---
 src/backend/storage/page/README   | 12 +++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v15-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v15-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From a043c0458fb75292340d3ddce3198f420728772e Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v15 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 26 ++++++++++++-----
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 459 insertions(+), 69 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a0453b36cde..5c71f94742f 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -650,6 +656,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ddecb8ab18e..3f7b600c44c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -260,6 +261,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -341,6 +343,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -348,7 +353,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -577,6 +582,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -631,6 +637,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -642,7 +651,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9926e2bd546..b9b80b2b074 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1957,6 +1958,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2280,6 +2283,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2841,6 +2846,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3593,6 +3600,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3766,6 +3775,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4699,6 +4710,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5490,6 +5503,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5648,6 +5663,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5756,6 +5773,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5876,6 +5895,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5906,6 +5926,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5916,7 +5940,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0bb78162f54..1bafa3edde5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1485,6 +1490,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1502,7 +1510,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3336039125..c840912d116 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8c326a4774c..1c4acb82f04 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -184,6 +185,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -207,6 +209,10 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -219,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -337,6 +343,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -382,6 +389,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -400,7 +411,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1147,6 +1158,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1277,6 +1291,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2068,6 +2084,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2156,6 +2173,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2207,7 +2228,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2300,6 +2321,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	int			targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber nextchild;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2521,6 +2543,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2600,7 +2626,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222e..d689473a713 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7dcfa023236..bc0dcddb9b2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 70d22577cee..05666fe15a2 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1113,6 +1114,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2203,6 +2206,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2293,6 +2299,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2264c2c849c..15cb9b9a25c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 1de27529b69..be4974002ad 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e37dbada4db..1d2ea469bce 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0df545fc612..2ce0d5e20e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1043,7 +1043,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2880,9 +2880,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9011,6 +9013,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9262,6 +9266,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9419,6 +9426,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10073,7 +10082,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10087,10 +10096,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10112,8 +10121,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 033f8a7bdd9..e4ee43e4a41 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#87

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#86)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is the rebase version for the latest master head(b3a9e9897ec).

Regards,
Amul

Attachments:

v16-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v16-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 095b23cd5d152f6adc1c042f762d06a0af105061 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v16 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 26 ++++++++++++-----
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 459 insertions(+), 69 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9c1d590dc71..82c1bd1b6c6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1957,6 +1958,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2280,6 +2283,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2840,6 +2845,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3591,6 +3598,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3764,6 +3773,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4696,6 +4707,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5486,6 +5499,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5644,6 +5659,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5752,6 +5769,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -5872,6 +5891,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -5902,6 +5922,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5912,7 +5936,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e66..5158e3ee24e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1485,6 +1490,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1502,7 +1510,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 1edb9f95797..6ef57e8384c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 629a23628ef..97b216a2553 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -195,6 +196,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	BTMetaPageData *metad;
 	bool		rewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -232,6 +234,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -244,7 +250,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -363,6 +369,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -408,6 +415,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -426,7 +437,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1145,6 +1156,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1275,6 +1289,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2059,6 +2075,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2147,6 +2164,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2198,7 +2219,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2285,6 +2306,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2512,6 +2534,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2588,7 +2614,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6cc..ac0c6af49e0 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2195,6 +2198,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2285,6 +2291,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 1de27529b69..be4974002ad 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8f81974ac84..f9cc3dcca11 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fa293ebf600..68b2bfa4dab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1043,7 +1043,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2874,9 +2874,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8997,6 +8999,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!XLogInsertAllowed() && !RecoveryInProgress())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9248,6 +9252,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9405,6 +9412,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10057,7 +10066,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10071,10 +10080,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10096,8 +10105,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 170de288c2c..bb4cb53dcb6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v16-0003-WIP-Documentation.patchapplication/octet-stream; name=v16-0003-WIP-Documentation.patchDownload

From 94e8593fa87cb169e8f7bfd50db294b5e859ef3e Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v16 3/3] WIP - Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 33 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 08f08322ca5..1be887c4ce2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24696,9 +24696,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24784,6 +24784,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..c90176c6167 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,37 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write-ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>system_is_read_only</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then that session will be terminated.  This
+    is useful for HA setup where the master server needs to stop accepting WAL
+    writes immediately and kick out any transaction expecting WAL writes at the
+    end, in case of network down on master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v16-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v16-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 8fb5def73f98d2c37efeb8d2c38cc286351e6fc7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v16 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add system_is_read_only GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 404 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 ++-
 src/backend/access/transam/xlog.c        | 336 ++++++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 ++
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   4 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 894 insertions(+), 138 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..1de27529b69
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,404 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only (wal
+	 * prohibited) server, which should be completed before changing the system
+	 * state to read write.  To disallow any other backend from writing a wal
+	 * record before the end of crash recovery checkpoint finishes, we let the
+	 * server in recovery mode.
+	 */
+	if (!StartupCrashRecoveryIsPending())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();		/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();		/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state to
+	 * all backend.  Checkpointer will do that and update the shared memory wal
+	 * prohibit state counter.
+	 *
+	 * If the end-of-recovery checkpoint and required wal write to start the
+	 * server normally, has been skipped previously, then do that now.
+	 */
+	if (StartupCrashRecoveryIsPending())
+	{
+		/* Should to be here while changing system to read write. */
+		Assert(!walprohibit);
+		PerformPendingStartupOperations();
+	}
+	else if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();	/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_READ_ONLY);
+
+	/* Update the control file to make state persistent */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	WALProhibitState cur_state;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be sure
+	 * it has processed all pending wal prohibit state change requests as soon
+	 * as possible.  Since CreateCheckPoint and ProcessSyncRequests sometimes
+	 * runs in non-checkpointer processes, do nothing if not checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	while (cur_state != WALPROHIBIT_STATE_READ_WRITE)
+	{
+		if (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+			cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE)
+		{
+			CompleteWALProhibitChange();
+		}
+		else if (cur_state == WALPROHIBIT_STATE_READ_ONLY)
+		{
+			int			rc;
+
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write state.
+			 */
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_WALPROHIBIT_STATE);
+
+			/*
+			 * If the postmaster dies or a shutdown request is received, just
+			 * bail out.
+			 */
+			if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+				return;
+		}
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df6b87..8f81974ac84 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb87324..fa293ebf600 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -730,6 +732,14 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * startupCrashRecoveryPending indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.  Lock
+	 * protection is not needed since it isn't going to be read and/or updated
+	 * concurrently.
+	 */
+	bool		startupCrashRecoveryPending;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -978,6 +988,13 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+							 bool localPromoteIsTriggered,
+							 XLogReaderState *xlogreader,
+							 bool archiveRecoveryRequested,
+							 TimeLineID endOfLogTLI, XLogRecPtr endOfLog,
+							 TimeLineID thisTimeLineID);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6179,6 +6196,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of startupCrashRecoveryPending flag.
+ */
+bool
+StartupCrashRecoveryIsPending(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->startupCrashRecoveryPending;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6384,6 +6411,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6532,13 +6560,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only (wal
+		 * prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7772,16 +7809,130 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
 	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	 /*
+	  * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will be
+	  * written later in XLogAcceptWrites.
+	  */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->startupCrashRecoveryPending = true;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+	{
+		promoted = XLogAcceptWrites(needChkpt, bgwriterLaunched,
+									LocalPromoteIsTriggered, xlogreader,
+									ArchiveRecoveryRequested,
+									EndOfLogTLI, EndOfLog, ThisTimeLineID);
+	}
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  Only the Startup process can call this function directly.
+ */
+static bool
+XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+				 bool localPromoteIsTriggered, XLogReaderState *xlogreader,
+				 bool archiveRecoveryRequested, TimeLineID endOfLogTLI,
+				 XLogRecPtr endOfLog, TimeLineID thisTimeLineID)
+{
+	bool		promoted = false;
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7799,15 +7950,17 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
-			if (LocalPromoteIsTriggered)
+			if (localPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7836,8 +7989,10 @@ StartupXLOG(void)
 			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
 	}
 
-	if (ArchiveRecoveryRequested)
+	if (archiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7855,7 +8010,7 @@ StartupXLOG(void)
 		 * pre-allocated files containing garbage. In any case, they are not
 		 * part of the new timeline's history so we don't need them.
 		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+		RemoveNonParentXlogFiles(endOfLog, thisTimeLineID);
 
 		/*
 		 * If the switch happened in the middle of a segment, what to do with
@@ -7886,14 +8041,14 @@ StartupXLOG(void)
 		 * restored from the archive to begin with, it's expected to have a
 		 * .done file).
 		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		if (XLogSegmentOffset(endOfLog, wal_segment_size) != 0 &&
 			XLogArchivingActive())
 		{
 			char		origfname[MAXFNAMELEN];
 			XLogSegNo	endLogSegNo;
 
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			XLByteToPrevSeg(endOfLog, endLogSegNo, wal_segment_size);
+			XLogFileName(origfname, endOfLogTLI, endLogSegNo, wal_segment_size);
 
 			if (!XLogArchiveIsReadyOrDone(origfname))
 			{
@@ -7901,7 +8056,7 @@ StartupXLOG(void)
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
 
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+				XLogFilePath(origpath, endOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
 				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
 
@@ -7917,63 +8072,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8007,20 +8112,51 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
+	return promoted;
+}
+
+/*
+ * This function should be called only if wal write accepts operation in the
+ * startup process had skipped due to read only system state (wal prohibited)
+ * and should be called only once while changing the system to read write.
+ */
+void
+PerformPendingStartupOperations(void)
+{
+	Assert(StartupCrashRecoveryIsPending());
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * When we do skip the end of recovery checkpoint we always have
+	 * InRecovery = true, for more detail see the place where
+	 * startupCrashRecoveryPending flag is set in StartupXLOG. Now we are
+	 * performing this operation here which means we do have got the
+	 * necessary auxiliary process therefore bgwriterLaunched is also true.
+	 * This end of recovery checkpoint will never be skipped if
+	 * ArchiveRecoveryRequested = true, at that time system implicitly get
+	 * out for the wal prohibit state and does allows all the wal write
+	 * operation in the startup. Therefore ArchiveRecoveryRequested is false
+	 * here, and value for the rest of the parameters will be inapplicable.
 	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	(void) XLogAcceptWrites(true,				/* needChkpt */
+							true,				/* bgwriterLaunched */
+							false,				/* localPromoteIsTriggered */
+							NULL,				/* xlogreader */
+							false,				/* archiveRecoveryRequested */
+							0,					/* endOfLogTLI */
+							InvalidXLogRecPtr,	/* endOfLog */
+							0);					/* thisTimeLineID */
+
+	XLogCtl->startupCrashRecoveryPending = false;
+}
+
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
 }
 
 /*
@@ -8237,9 +8373,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8258,9 +8394,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8282,6 +8429,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8571,9 +8724,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8586,6 +8743,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+			   (errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d78..da154254a4d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1525,6 +1525,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb4..170de288c2c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719dd..63d52825497 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4307,6 +4307,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b4..31383a11d08 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -98,7 +99,6 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -538,8 +538,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -604,24 +604,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d626731723b..6547a4f0e1d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_system_is_read_only(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool system_is_read_only;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2048,6 +2050,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"system_is_read_only", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&system_is_read_only,
+		false,
+		NULL, NULL, show_system_is_read_only
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12228,4 +12242,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_system_is_read_only(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..7bff0adc2cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,7 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool StartupCrashRecoveryIsPending(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +327,8 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1487710d590..62b8ac41702 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11379,6 +11379,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87e..8f4fc4f1e15 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3adb3b..cd89ff06790 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2698,6 +2698,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitStateData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

#88

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Amul Sul (#86)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Feb 19, 2021 at 5:43 PM Amul Sul <sulamul@gmail.com> wrote:

In the attached version I have made the changes accordingly what Robert has
summarised in his previous mail[1].

In addition to that, I also move the code that updates the control file to
XLogAcceptWrites() which will also get skipped when the system is read-only (wal
prohibited). The system will be in the crash recovery, and that will
change once we do the end-of-recovery checkpoint and the WAL writes operation
which we were skipping from startup. The benefit of keeping the system in
recovery mode is that it fixes my concern[2] where other backends could connect
and write wal records while we were changing the system to read-write. Now, no
other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery
checkpoint, and XLogReportParameters() operations will be performed in the same
sequence as it is in the startup while changing the system to read-write.

I was looking into the changes espcially recovery related problem, I
have a few questions

1.
+static bool
+XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+                 bool localPromoteIsTriggered, XLogReaderState *xlogreader,
+                 bool archiveRecoveryRequested, TimeLineID endOfLogTLI,
+                 XLogRecPtr endOfLog, TimeLineID thisTimeLineID)
+{
+    bool        promoted = false;
+
+    /*
.....
+            if (localPromoteIsTriggered)
             {
-                checkPointLoc = ControlFile->checkPoint;
+                XLogRecord *record;

...
+                record = ReadCheckpointRecord(xlogreader,
+                                              ControlFile->checkPoint,
+                                              1, false);
                 if (record != NULL)
                 {
                     promoted = true;
                    ...
                    CreateEndOfRecoveryRecord();
                }

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

1] /messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com
2] /messages/by-id/CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com

2.
I did not clearly understand your concern in point [2], because of
which you have to postpone RECOVERY_STATE_DONE untill system is set
back to read-write. Can you explain this?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#89

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#88)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Mar 2, 2021 at 5:52 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Feb 19, 2021 at 5:43 PM Amul Sul <sulamul@gmail.com> wrote:

In the attached version I have made the changes accordingly what Robert has
summarised in his previous mail[1].

In addition to that, I also move the code that updates the control file to
XLogAcceptWrites() which will also get skipped when the system is read-only (wal
prohibited). The system will be in the crash recovery, and that will
change once we do the end-of-recovery checkpoint and the WAL writes operation
which we were skipping from startup. The benefit of keeping the system in
recovery mode is that it fixes my concern[2] where other backends could connect
and write wal records while we were changing the system to read-write. Now, no
other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery
checkpoint, and XLogReportParameters() operations will be performed in the same
sequence as it is in the startup while changing the system to read-write.

I was looking into the changes espcially recovery related problem, I
have a few questions
1.
+static bool
+XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+                 bool localPromoteIsTriggered, XLogReaderState *xlogreader,
+                 bool archiveRecoveryRequested, TimeLineID endOfLogTLI,
+                 XLogRecPtr endOfLog, TimeLineID thisTimeLineID)
+{
+    bool        promoted = false;
+
+    /*
.....
+            if (localPromoteIsTriggered)
{
-                checkPointLoc = ControlFile->checkPoint;
+                XLogRecord *record;
...
+                record = ReadCheckpointRecord(xlogreader,
+                                              ControlFile->checkPoint,
+                                              1, false);
if (record != NULL)
{
promoted = true;
...
CreateEndOfRecoveryRecord();
}
Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.

XLogAcceptWrites() tried to club all the WAL write operations that happen at the
end of StartupXLOG(). The only exception is that promotion checkpoint.

That will look cleaner.

I think it would be better to move the promotion checkpoint call inside
XLogAcceptWrites() as well. So that we can say XLogAcceptWrites() is a part of
StartupXLOG() does the required WAL writes. Thoughts?

1] /messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com
2] /messages/by-id/CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com

2.
I did not clearly understand your concern in point [2], because of
which you have to postpone RECOVERY_STATE_DONE untill system is set
back to read-write. Can you explain this?

Sure, for that let me explain how this transition to read-write occurs. When a
backend executes a function (ie. pg_prohibit_wal(false)) to make the system
read-write then that system state changes will be conveyed by the Checkpointer
process to all existing backends using global barrier and while Checkpointer in
the progress of conveying this barrier, any existing backends who might have
absorbed barriers can write new records.

We don't want that to happen in cases where previous recovery-end-checkpoint is
skipped in startup. We want Checkpointer first to convey the barrier to all
backends but, the backend shouldn't write wal until the Checkpointer writes
recovery-end-checkpoint record.

To refrain these backends from writing WAL I think we should keep the server in
crash recovery mode until UpdateFullPageWrites(),
end-of-recovery-checkpoint, and XLogReportParameters() are performed.

Regards,
Amul

#90

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Amul Sul (#89)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Mar 2, 2021 at 7:54 PM Amul Sul <sulamul@gmail.com> wrote:

XLogAcceptWrites() tried to club all the WAL write operations that happen at the
end of StartupXLOG(). The only exception is that promotion checkpoint.

Okay, I was expecting that XLogAcceptWrites should have all the WAL
write-related operations which should either executed at the end of
StartupXLOG() if the system is not read-only or after the system is
set back to read-write. But promotion-related code is completely
irrelevant when it is executed from PerformPendingStartupOperations.
So I am not entirely sure that whether we want to keep those stuff in
XLogAcceptWrites.

That will look cleaner.

I think it would be better to move the promotion checkpoint call inside
XLogAcceptWrites() as well. So that we can say XLogAcceptWrites() is a part of
StartupXLOG() does the required WAL writes. Thoughts?

Okay so if we want to keep all the WAL write inside XLogAcceptWrites
including promotion-related stuff then +1 for moving this also inside
XLogAcceptWrites.

1] /messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com
2] /messages/by-id/CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com

2.
I did not clearly understand your concern in point [2], because of
which you have to postpone RECOVERY_STATE_DONE untill system is set
back to read-write. Can you explain this?

Sure, for that let me explain how this transition to read-write occurs. When a
backend executes a function (ie. pg_prohibit_wal(false)) to make the system
read-write then that system state changes will be conveyed by the Checkpointer
process to all existing backends using global barrier and while Checkpointer in
the progress of conveying this barrier, any existing backends who might have
absorbed barriers can write new records.

We don't want that to happen in cases where previous recovery-end-checkpoint is
skipped in startup. We want Checkpointer first to convey the barrier to all
backends but, the backend shouldn't write wal until the Checkpointer writes
recovery-end-checkpoint record.

To refrain these backends from writing WAL I think we should keep the server in
crash recovery mode until UpdateFullPageWrites(),
end-of-recovery-checkpoint, and XLogReportParameters() are performed.

Thanks for the explanation. Now, I understand the problem, however, I
am not sure that whether keeping the system in recovery is the best
way to solve this but as of now I don't have anything better to
suggest, and immediately I couldn’t think of any problem with this
solution. But I will think about this again.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#91

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#90)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Mar 2, 2021 at 9:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

We don't want that to happen in cases where previous recovery-end-checkpoint is
skipped in startup. We want Checkpointer first to convey the barrier to all
backends but, the backend shouldn't write wal until the Checkpointer writes
recovery-end-checkpoint record.

To refrain these backends from writing WAL I think we should keep the server in
crash recovery mode until UpdateFullPageWrites(),
end-of-recovery-checkpoint, and XLogReportParameters() are performed.

I did not read the code for this, but let me ask something about this
case. Why do we want checkpointer to convey the barrier to all the
backend before completing the end of recovery checkpoint and other
stuff? Is it because the system is still in WAL prohibited state? Is
it possible that as soon as we get the pg_prohibit_wal(false) request
the receiving backend start allowing the WAL writing for itself and
finish the all post-recovery pending work and then inform the
checkpointer to inform all other backends?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#92

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#91)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Mar 2, 2021 at 9:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

We don't want that to happen in cases where previous recovery-end-checkpoint is
skipped in startup. We want Checkpointer first to convey the barrier to all
backends but, the backend shouldn't write wal until the Checkpointer writes
recovery-end-checkpoint record.

To refrain these backends from writing WAL I think we should keep the server in
crash recovery mode until UpdateFullPageWrites(),
end-of-recovery-checkpoint, and XLogReportParameters() are performed.

I did not read the code for this, but let me ask something about this
case. Why do we want checkpointer to convey the barrier to all the
backend before completing the end of recovery checkpoint and other
stuff? Is it because the system is still in WAL prohibited state?

Consider the previous case, where the user wants to change the system to
read-write. When a permitted user executes pg_prohibit_wal(false), the wal
prohibited state in shared memory updated to GOING_READ_WRITE
which is the transition state and then waits until the transition state
completes and the final state (i.e. READ_WRITE) gets updated
in shared memory. To set the final set is a job of the Checkpointer process.

We have integrated code into the Checkpointer process such that if it sees wal
prohibit transition state then it completes that as soon as possible by doing
necessary steps i.e. emitting super barriers, then update the final wal
prohibited state in shared memory and in control file.

Is
it possible that as soon as we get the pg_prohibit_wal(false) request
the receiving backend start allowing the WAL writing for itself and
finish the all post-recovery pending work and then inform the
checkpointer to inform all other backends?

Yes, it is possible to allow wal temporarily for itself by setting
LocalXLogInsertAllowed, but when we request Checkpointer for the end-of-recovery
checkpoint, the first thing it will do is that wal prohibit state transition
then recovery-end-checkpoint.

Also, allowing WAL write in read-only (WAL prohibited state) mode is against
this feature principle.

Regards,
Amul

#93

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Amul Sul (#92)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Mar 3, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, it is possible to allow wal temporarily for itself by setting
LocalXLogInsertAllowed, but when we request Checkpointer for the end-of-recovery
checkpoint, the first thing it will do is that wal prohibit state transition
then recovery-end-checkpoint.

Also, allowing WAL write in read-only (WAL prohibited state) mode is against
this feature principle.

So IIUC before the checkpoint change the state in the control file we
anyway inform other backend and then they are allowed to write the WAL
is the right? If that is true then what is the problem in first doing
the pending post-recovery process and then informing the backend about
the state change. I mean we are in a process of changing the state to
read-write so why it is necessary to inform all the backend before we
can write WAL? Are we afraid that after we write the WAL and if there
is some failure before we make the system read-write then it will
break the principle of the feature, I mean eventually system will stay
read-only but we wrote the WAL? If so then currently, are we assuring
that once we inform the backend and backend are allowed to write the
WAL there are no chances of failure and the system state is guaranteed
to be changed. If all that is true then I will take my point back.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#94

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#88)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

The key design question here, at least in my mind, is what exactly
happens after prohibit-WAL + system-crash + recovery-finishes. We
clearly can't write the checkpoint or end-of-recovery record and
proceed with business as usual, but are we still in recovery? Either
(1) we are technically still in recovery, stopping just short of
entering normal running, and will emerge from recovery when WAL is
permitted again; or (2) we have technically finished recovery, but
deferred some of the actions that would normally occur at that time
until a later point. Maybe this is academic distinction as much as
anything, but the idea is if we choose #1 then we should do as little
as possible at the point when recovery finishes and defer as much as
possible until we actually enter normal running; whereas if we choose
#2 we should do as much as possible at the point when recovery
finishes and defer only those things which absolutely have to be
deferred. That said, I and I think also Andres are voting for #2.

But if we go that way, that precludes what you are proposing here. If
we picked #1 then it would be natural for the startup process to
remain active and the control file update to be postponed until WAL
writes are re-enabled; but under model #2 we want, if possible, for
the startup process to exit and the control file update to happen
normally, and only the writing of the actual WAL records to be
deferred.

What I find much odder, looking at the present patch, is that
PerformPendingStartupOperations() gets called from pg_prohibit_wal()
rather than by the checkpointer. If the checkpointer is the process
that is in charge of coordinating the change between a read-only state
and a read-write state, then it ought to also do this. I also think
that the PerformPendingStartupOperations() wrapper is unnecessary.
Just invert the sense of the XLogCtl flag: xlogAllowWritesDone, rather
than startupCrashRecoveryPending, and have XLogAcceptWrites() set it
(and return without doing anything if it's already set). Then the
checkpointer can just call the function unconditionally whenever we go
read-write, and for a bonus we will have much better naming
consistency, rather than calling the same thing "xlog accept writes"
in one place, "pending startup operations" in another, and "startup
crash recovery pending" in a third.

Since this feature is basically no longer "alter system read only" but
rather "pg_prohibit_wal" I think we also ought to rename the GUC,
system_is_read_only -> wal_prohibited.

--
Robert Haas
EDB: http://www.enterprisedb.com

#95

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Robert Haas (#94)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

The key design question here, at least in my mind, is what exactly
happens after prohibit-WAL + system-crash + recovery-finishes. We
clearly can't write the checkpoint or end-of-recovery record and
proceed with business as usual, but are we still in recovery? Either
(1) we are technically still in recovery, stopping just short of
entering normal running, and will emerge from recovery when WAL is
permitted again; or (2) we have technically finished recovery, but
deferred some of the actions that would normally occur at that time
until a later point. Maybe this is academic distinction as much as
anything, but the idea is if we choose #1 then we should do as little
as possible at the point when recovery finishes and defer as much as
possible until we actually enter normal running; whereas if we choose
#2 we should do as much as possible at the point when recovery
finishes and defer only those things which absolutely have to be
deferred. That said, I and I think also Andres are voting for #2.

But if we go that way, that precludes what you are proposing here. If
we picked #1 then it would be natural for the startup process to
remain active and the control file update to be postponed until WAL
writes are re-enabled; but under model #2 we want, if possible, for
the startup process to exit and the control file update to happen
normally, and only the writing of the actual WAL records to be
deferred.

Current patch doing a mix of both, startup process exits without doing
WAL writes and control file updates, that happens later when system
changes to read-write.

What I find much odder, looking at the present patch, is that
PerformPendingStartupOperations() gets called from pg_prohibit_wal()
rather than by the checkpointer. If the checkpointer is the process
that is in charge of coordinating the change between a read-only state
and a read-write state, then it ought to also do this. I also think
that the PerformPendingStartupOperations() wrapper is unnecessary.
Just invert the sense of the XLogCtl flag: xlogAllowWritesDone, rather
than startupCrashRecoveryPending, and have XLogAcceptWrites() set it
(and return without doing anything if it's already set). Then the
checkpointer can just call the function unconditionally whenever we go
read-write, and for a bonus we will have much better naming
consistency, rather than calling the same thing "xlog accept writes"
in one place, "pending startup operations" in another, and "startup
crash recovery pending" in a third.

Ok, in the attached version, I have used the xlogAllowWritesDone variable.
To match the naming sense, it should be set to 'false' initially and
should get set to 'true' when the XLogAcceptWrites() operation completes.

I have removed the PerformPendingStartupOperations() wrapper function and I have
slightly changed XLogAcceptWrites() to minimize its parameter count so that it
can use available global variable values instead of parameters. Unfortunately,
it cannot be called from checkpointer unconditionally, it will create a race
with startup process when startup process still in recovery and checkpointer
launches and see that xlogAllowWritesDone = false, will go-ahead for those wal
write operations and end-of-recovery checkpoint which will be a
disaster. Therefore, I moved this XLogAcceptWrites() function inside
ProcessWALProhibitStateChangeRequest() and called when the system is in
GOING_READ_WRITE transition state. Since ProcessWALProhibitStateChangeRequest()
gets called from a different places of checkpointer process which creates a
cascaded call to XLogAcceptWrites() function, to avoid that I am updating
xlogAllowWritesDone = true immediately after it gets checked in
XLogAcceptWrites() which I think is not the right approach, technically, it
should be updated at the end of XLogAcceptWrites().

I think instead of xlogAllowWritesDone, we should use invert of it, as
the previous, e.g.xlogAllowWritesPending or xlogAllowWritesSkipped or something
else and that will be get explicitly set 'true' when we skip XLogAcceptWrites()
call. That will avoid the race of checkpointer process with the startup since
initially, it will be 'false', and if it is 'false' we will return immediately
from XLogAcceptWrites(). Also, we don't need to move XLogAcceptWrites() inside
ProcessWALProhibitStateChangeRequest(), it can be called from checkpointerMain()
loop which also avoids cascade calls and we don't need to update it until we
complete those write operations. Thoughts/Comments?

Since this feature is basically no longer "alter system read only" but
rather "pg_prohibit_wal" I think we also ought to rename the GUC,
system_is_read_only -> wal_prohibited.

Done.

Regards,
Amul

Attachments:

v17-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v17-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From aa10e528840bec423787e1edc821fe4874f43a3c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v17 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 26 ++++++++++++-----
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 459 insertions(+), 69 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d0..661e88da372 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2104,6 +2105,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2427,6 +2430,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2987,6 +2992,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3738,6 +3745,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3911,6 +3920,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4843,6 +4854,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5633,6 +5646,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5791,6 +5806,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5899,6 +5916,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -6019,6 +6038,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6049,6 +6069,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6059,7 +6083,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e66..5158e3ee24e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1485,6 +1490,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1502,7 +1510,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 1edb9f95797..6ef57e8384c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c09e492a5f3..769052ca83b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -195,6 +196,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	BTMetaPageData *metad;
 	bool		rewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -232,6 +234,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -244,7 +250,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -363,6 +369,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -408,6 +415,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -426,7 +437,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1145,6 +1156,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1275,6 +1289,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2059,6 +2075,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2147,6 +2164,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2198,7 +2219,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2285,6 +2306,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2516,6 +2538,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2592,7 +2618,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6cc..ac0c6af49e0 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2195,6 +2198,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2285,6 +2291,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 857c0cf997c..0dd1372e3e6 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8f81974ac84..f9cc3dcca11 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 53d25045cd1..e5650a0710c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1034,7 +1034,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2865,9 +2865,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8969,6 +8971,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9230,6 +9234,9 @@ CreateCheckPoint(int flags)
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		LocalXLogInsertAllowed = 1;
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9387,6 +9394,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10039,7 +10048,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10053,10 +10062,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
+	 * initialization done by XLogInsertAllowed() doesn't trigger an
 	 * assertion failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10078,8 +10087,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 170de288c2c..bb4cb53dcb6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v17-0003-WIP-Documentation.patchapplication/x-patch; name=v17-0003-WIP-Documentation.patchDownload

From 35b9bfb67c19c4294f31031615ed95cd9bdf0a43 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v17 3/3] WIP - Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 33 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index fee05619612..0eb27ecdd19 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24717,9 +24717,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24805,6 +24805,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..4e3d39fe94d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,37 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write-ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then that session will be terminated.  This
+    is useful for HA setup where the master server needs to stop accepting WAL
+    writes immediately and kick out any transaction expecting WAL writes at the
+    end, in case of network down on master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v17-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v17-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 2a9f98c32ac6b96ae7be89ff2722b339f7c04241 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v17 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 425 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +-
 src/backend/access/transam/xlog.c        | 304 +++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 +
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   6 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 892 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..857c0cf997c
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,425 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (XLogWriteAllowedIsDone())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	if (cur_state == WALPROHIBIT_STATE_READ_WRITE ||
+		cur_state == WALPROHIBIT_STATE_READ_ONLY)
+		return;
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	WALProhibitState cur_state;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If server is started in wal prohibited state then required
+				 * wal write operation in the startup process to start server
+				 * normally has been skipped, if it is, then do that right
+				 * away.
+				 */
+				ResetLocalXLogInsertAllowed();
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df6b87..8f81974ac84 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb87324..53d25045cd1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -730,6 +732,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesDone indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.
+	 */
+	bool		xlogAllowWritesDone;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6179,6 +6187,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesDone flag.
+ */
+bool
+XLogWriteAllowedIsDone(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->xlogAllowWritesDone;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6382,8 +6400,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6532,13 +6550,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7772,16 +7799,136 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (XLogWriteAllowedIsDone())
+		return;
+
+	/* Lock protection is not needed since we are the only one update this. */
+	XLogCtl->xlogAllowWritesDone = true;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7799,15 +7946,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7838,6 +7990,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7917,63 +8071,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8008,21 +8112,25 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8237,9 +8345,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8258,9 +8366,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to "unconditionally
+	 * true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8282,6 +8401,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8571,9 +8696,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8586,6 +8715,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
@@ -9088,6 +9220,16 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/*
+	 * We are already enabling wal insert for the end-of-recovery checkpoint but
+	 * when we doing this checkpoint for the wal prohibit state transition then
+	 * the local wal insert flag used to get reset before we reach here while
+	 * processing barrier for ourself.  Set that flag again to avoid an error
+	 * for the end-of-recovery checkpoint wal insert operation.
+	 */
+	if (flags & CHECKPOINT_END_OF_RECOVERY)
+		LocalXLogInsertAllowed = 1;
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54a..d4291121540 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1527,6 +1527,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb4..170de288c2c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719dd..63d52825497 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4307,6 +4307,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4bcf705a30d..3dbff0e8785 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2049,6 +2051,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12229,4 +12243,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..b0033f5a599 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,7 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool XLogWriteAllowedIsDone(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +327,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 59d2b71ca9c..618a4029c28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11382,6 +11382,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87e..8f4fc4f1e15 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95aefa1d..e8c0fb54547 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2700,6 +2700,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

#96

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#93)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Mar 3, 2021 at 7:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 3, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, it is possible to allow wal temporarily for itself by setting
LocalXLogInsertAllowed, but when we request Checkpointer for the end-of-recovery
checkpoint, the first thing it will do is that wal prohibit state transition
then recovery-end-checkpoint.

Also, allowing WAL write in read-only (WAL prohibited state) mode is against
this feature principle.

So IIUC before the checkpoint change the state in the control file we
anyway inform other backend and then they are allowed to write the WAL
is the right? If that is true then what is the problem in first doing
the pending post-recovery process and then informing the backend about
the state change. I mean we are in a process of changing the state to
read-write so why it is necessary to inform all the backend before we
can write WAL? Are we afraid that after we write the WAL and if there
is some failure before we make the system read-write then it will
break the principle of the feature, I mean eventually system will stay
read-only but we wrote the WAL? If so then currently, are we assuring
that once we inform the backend and backend are allowed to write the
WAL there are no chances of failure and the system state is guaranteed
to be changed. If all that is true then I will take my point back.

The wal prohibit state transition handling code is integrated into various
places of the checkpointer process so that it can pick state changes as soon as
possible. Before informing other backends we can do UpdateFullPageWrites() but
when we try to next the end-of-recovery checkpoint write operation, the
Checkpointer will hit ProcessWALProhibitStateChangeRequest() first which will
try for the wal prohibit transition state completion and then write
the checkpoint
record.

Regards,
Amul

#97

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Robert Haas (#94)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

The key design question here, at least in my mind, is what exactly
happens after prohibit-WAL + system-crash + recovery-finishes. We
clearly can't write the checkpoint or end-of-recovery record and
proceed with business as usual, but are we still in recovery? Either
(1) we are technically still in recovery, stopping just short of
entering normal running, and will emerge from recovery when WAL is
permitted again; or (2) we have technically finished recovery, but
deferred some of the actions that would normally occur at that time
until a later point. Maybe this is academic distinction as much as
anything, but the idea is if we choose #1 then we should do as little
as possible at the point when recovery finishes and defer as much as
possible until we actually enter normal running; whereas if we choose
#2 we should do as much as possible at the point when recovery
finishes and defer only those things which absolutely have to be
deferred. That said, I and I think also Andres are voting for #2.

But if we go that way, that precludes what you are proposing here. If
we picked #1 then it would be natural for the startup process to
remain active and the control file update to be postponed until WAL
writes are re-enabled; but under model #2 we want, if possible, for
the startup process to exit and the control file update to happen
normally, and only the writing of the actual WAL records to be
deferred.

Maybe I did not put my point clearly, let me clarify that. First, I
was also inclined that it should work like #2. And, if it works like
#2 then I would assume that the code goes in XLogAcceptWrites function
should be minimal, only those part which we want to execute after the
system is back to read-write mode. So basically, the XLogAcceptWrites
should only keep the code that is common code which we want to execute
at the end of the StartupXLog if the system is normal or we want to
execute when the system is back to read-write if it was read only. So
my point was all the uncommon code that we have moved into
XLogAcceptWrites should be kept inside the StartupXLog function only.
So I think the promotion-related code doesn't belong to the
XLogAcceptWrites function.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#98

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#95)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

The key design question here, at least in my mind, is what exactly
happens after prohibit-WAL + system-crash + recovery-finishes. We
clearly can't write the checkpoint or end-of-recovery record and
proceed with business as usual, but are we still in recovery? Either
(1) we are technically still in recovery, stopping just short of
entering normal running, and will emerge from recovery when WAL is
permitted again; or (2) we have technically finished recovery, but
deferred some of the actions that would normally occur at that time
until a later point. Maybe this is academic distinction as much as
anything, but the idea is if we choose #1 then we should do as little
as possible at the point when recovery finishes and defer as much as
possible until we actually enter normal running; whereas if we choose
#2 we should do as much as possible at the point when recovery
finishes and defer only those things which absolutely have to be
deferred. That said, I and I think also Andres are voting for #2.

But if we go that way, that precludes what you are proposing here. If
we picked #1 then it would be natural for the startup process to
remain active and the control file update to be postponed until WAL
writes are re-enabled; but under model #2 we want, if possible, for
the startup process to exit and the control file update to happen
normally, and only the writing of the actual WAL records to be
deferred.

Current patch doing a mix of both, startup process exits without doing
WAL writes and control file updates, that happens later when system
changes to read-write.

What I find much odder, looking at the present patch, is that
PerformPendingStartupOperations() gets called from pg_prohibit_wal()
rather than by the checkpointer. If the checkpointer is the process
that is in charge of coordinating the change between a read-only state
and a read-write state, then it ought to also do this. I also think
that the PerformPendingStartupOperations() wrapper is unnecessary.
Just invert the sense of the XLogCtl flag: xlogAllowWritesDone, rather
than startupCrashRecoveryPending, and have XLogAcceptWrites() set it
(and return without doing anything if it's already set). Then the
checkpointer can just call the function unconditionally whenever we go
read-write, and for a bonus we will have much better naming
consistency, rather than calling the same thing "xlog accept writes"
in one place, "pending startup operations" in another, and "startup
crash recovery pending" in a third.

Ok, in the attached version, I have used the xlogAllowWritesDone variable.
To match the naming sense, it should be set to 'false' initially and
should get set to 'true' when the XLogAcceptWrites() operation completes.

I have removed the PerformPendingStartupOperations() wrapper function and I have
slightly changed XLogAcceptWrites() to minimize its parameter count so that it
can use available global variable values instead of parameters. Unfortunately,
it cannot be called from checkpointer unconditionally, it will create a race
with startup process when startup process still in recovery and checkpointer
launches and see that xlogAllowWritesDone = false, will go-ahead for those wal
write operations and end-of-recovery checkpoint which will be a
disaster. Therefore, I moved this XLogAcceptWrites() function inside
ProcessWALProhibitStateChangeRequest() and called when the system is in
GOING_READ_WRITE transition state. Since ProcessWALProhibitStateChangeRequest()
gets called from a different places of checkpointer process which creates a
cascaded call to XLogAcceptWrites() function, to avoid that I am updating
xlogAllowWritesDone = true immediately after it gets checked in
XLogAcceptWrites() which I think is not the right approach, technically, it
should be updated at the end of XLogAcceptWrites().

I think instead of xlogAllowWritesDone, we should use invert of it, as
the previous, e.g.xlogAllowWritesPending or xlogAllowWritesSkipped or something
else and that will be get explicitly set 'true' when we skip XLogAcceptWrites()
call. That will avoid the race of checkpointer process with the startup since
initially, it will be 'false', and if it is 'false' we will return immediately
from XLogAcceptWrites(). Also, we don't need to move XLogAcceptWrites() inside
ProcessWALProhibitStateChangeRequest(), it can be called from checkpointerMain()
loop which also avoids cascade calls and we don't need to update it until we
complete those write operations. Thoughts/Comments?

In the attached version, I am able to fix most of the concerns that I had. Right
now, having the xlogAllowWritesDone variable is fine, and that will get updated
at the end of the XLogAcceptWrites() function, unlike the previous.
XLogAcceptWrites() will be called from ProcessWALProhibitStateChangeRequest()
while the system state changes to read-write, like previous. Now to avoid the
recursive call to ProcessWALProhibitStateChangeRequest() from the
end-of-recovery checkpoint happening in XLogAcceptWrites(), I have added a
private boolean state variable in walprohibit.c, using it wal prohibit state
transition can be put on hold for some time; did the same while calling
XLogAcceptWrites().

Regards,
Amul

Attachments:

v18-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v18-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 70f638966e2e26239eec09ae57d8801f67465ee2 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v18 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 438 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +-
 src/backend/access/transam/xlog.c        | 292 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 +
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   6 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 893 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..824a19d62e0
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (XLogWriteAllowedIsDone())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c83aa16f2ce..df872d42ffa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb87324..49286a16c31 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -730,6 +732,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesDone indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.
+	 */
+	bool		xlogAllowWritesDone;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6179,6 +6187,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesDone flag.
+ */
+bool
+XLogWriteAllowedIsDone(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->xlogAllowWritesDone;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6382,8 +6400,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6532,13 +6550,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7772,16 +7799,133 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (XLogWriteAllowedIsDone())
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7799,15 +7943,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7838,6 +7987,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7917,63 +8068,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8002,27 +8103,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->xlogAllowWritesDone = true;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8237,9 +8343,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8258,9 +8364,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8282,6 +8399,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8571,9 +8694,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8586,6 +8713,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fb1116d09ad..5d3038a460b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1526,6 +1526,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb4..170de288c2c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9259dc9d3e1..4c17ffab008 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4306,6 +4306,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe26..8c3067b2773 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2049,6 +2051,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12227,4 +12241,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..b0033f5a599 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,7 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool XLogWriteAllowedIsDone(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +327,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2ccc3e7c7c7..f386449571c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11395,6 +11395,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e0c70d221be..8ac620d1188 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95aefa1d..e8c0fb54547 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2700,6 +2700,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 5c4f958249de981b46391a0040aaf1e7c767190a Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v18 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 460 insertions(+), 70 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d0..661e88da372 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2104,6 +2105,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2427,6 +2430,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2987,6 +2992,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3738,6 +3745,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3911,6 +3920,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4843,6 +4854,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5633,6 +5646,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5791,6 +5806,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5899,6 +5916,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -6019,6 +6038,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6049,6 +6069,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6059,7 +6083,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e66..5158e3ee24e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1485,6 +1490,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1502,7 +1510,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 1edb9f95797..6ef57e8384c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c09e492a5f3..769052ca83b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -195,6 +196,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	BTMetaPageData *metad;
 	bool		rewrite = false;
 	XLogRecPtr	recptr;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -232,6 +234,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -244,7 +250,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 
@@ -363,6 +369,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -408,6 +415,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -426,7 +437,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1145,6 +1156,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1275,6 +1289,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2059,6 +2075,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2147,6 +2164,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2198,7 +2219,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2285,6 +2306,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2516,6 +2538,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2592,7 +2618,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16fb..31404dfdb70 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2195,6 +2198,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2293,6 +2299,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 824a19d62e0..5dae4667975 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index df872d42ffa..ef5f7ef0a2c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 49286a16c31..309e4b25f81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1034,7 +1034,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2865,9 +2865,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8967,6 +8969,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9218,6 +9222,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9375,6 +9382,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10027,7 +10036,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10041,10 +10050,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10066,8 +10075,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 170de288c2c..bb4cb53dcb6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v18-0003-WIP-Documentation.patchapplication/x-patch; name=v18-0003-WIP-Documentation.patchDownload

From fa91ccc9ca9bf6c22c21b5e362129787206078cf Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v18 3/3] WIP - Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 33 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ece09699ef8..f2246379b73 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24744,9 +24744,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24832,6 +24832,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..4e3d39fe94d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,37 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write-ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then that session will be terminated.  This
+    is useful for HA setup where the master server needs to stop accepting WAL
+    writes immediately and kick out any transaction expecting WAL writes at the
+    end, in case of network down on master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#99

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 5 years ago

In reply to: Amul Sul (#98)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Mar 9, 2021 at 3:31 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com>

wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

The key design question here, at least in my mind, is what exactly
happens after prohibit-WAL + system-crash + recovery-finishes. We
clearly can't write the checkpoint or end-of-recovery record and
proceed with business as usual, but are we still in recovery? Either
(1) we are technically still in recovery, stopping just short of
entering normal running, and will emerge from recovery when WAL is
permitted again; or (2) we have technically finished recovery, but
deferred some of the actions that would normally occur at that time
until a later point. Maybe this is academic distinction as much as
anything, but the idea is if we choose #1 then we should do as little
as possible at the point when recovery finishes and defer as much as
possible until we actually enter normal running; whereas if we choose
#2 we should do as much as possible at the point when recovery
finishes and defer only those things which absolutely have to be
deferred. That said, I and I think also Andres are voting for #2.

But if we go that way, that precludes what you are proposing here. If
we picked #1 then it would be natural for the startup process to
remain active and the control file update to be postponed until WAL
writes are re-enabled; but under model #2 we want, if possible, for
the startup process to exit and the control file update to happen
normally, and only the writing of the actual WAL records to be
deferred.

Current patch doing a mix of both, startup process exits without doing
WAL writes and control file updates, that happens later when system
changes to read-write.

What I find much odder, looking at the present patch, is that
PerformPendingStartupOperations() gets called from pg_prohibit_wal()
rather than by the checkpointer. If the checkpointer is the process
that is in charge of coordinating the change between a read-only state
and a read-write state, then it ought to also do this. I also think
that the PerformPendingStartupOperations() wrapper is unnecessary.
Just invert the sense of the XLogCtl flag: xlogAllowWritesDone, rather
than startupCrashRecoveryPending, and have XLogAcceptWrites() set it
(and return without doing anything if it's already set). Then the
checkpointer can just call the function unconditionally whenever we go
read-write, and for a bonus we will have much better naming
consistency, rather than calling the same thing "xlog accept writes"
in one place, "pending startup operations" in another, and "startup
crash recovery pending" in a third.

Ok, in the attached version, I have used the xlogAllowWritesDone

variable.

To match the naming sense, it should be set to 'false' initially and
should get set to 'true' when the XLogAcceptWrites() operation completes.

I have removed the PerformPendingStartupOperations() wrapper function

and I have

slightly changed XLogAcceptWrites() to minimize its parameter count so

that it

can use available global variable values instead of parameters.

Unfortunately,

it cannot be called from checkpointer unconditionally, it will create a

race

with startup process when startup process still in recovery and

checkpointer

launches and see that xlogAllowWritesDone = false, will go-ahead for

those wal

write operations and end-of-recovery checkpoint which will be a
disaster. Therefore, I moved this XLogAcceptWrites() function inside
ProcessWALProhibitStateChangeRequest() and called when the system is in
GOING_READ_WRITE transition state. Since

ProcessWALProhibitStateChangeRequest()

gets called from a different places of checkpointer process which

creates a

cascaded call to XLogAcceptWrites() function, to avoid that I am updating
xlogAllowWritesDone = true immediately after it gets checked in
XLogAcceptWrites() which I think is not the right approach, technically,

it

should be updated at the end of XLogAcceptWrites().

I think instead of xlogAllowWritesDone, we should use invert of it, as
the previous, e.g.xlogAllowWritesPending or xlogAllowWritesSkipped or

something

else and that will be get explicitly set 'true' when we skip

XLogAcceptWrites()

call. That will avoid the race of checkpointer process with the startup

since

initially, it will be 'false', and if it is 'false' we will return

immediately

from XLogAcceptWrites(). Also, we don't need to move XLogAcceptWrites()

inside

ProcessWALProhibitStateChangeRequest(), it can be called from

checkpointerMain()

loop which also avoids cascade calls and we don't need to update it

until we

complete those write operations. Thoughts/Comments?

In the attached version, I am able to fix most of the concerns that I had.
Right
now, having the xlogAllowWritesDone variable is fine, and that will get
updated
at the end of the XLogAcceptWrites() function, unlike the previous.
XLogAcceptWrites() will be called from
ProcessWALProhibitStateChangeRequest()
while the system state changes to read-write, like previous. Now to avoid
the
recursive call to ProcessWALProhibitStateChangeRequest() from the
end-of-recovery checkpoint happening in XLogAcceptWrites(), I have added a
private boolean state variable in walprohibit.c, using it wal prohibit
state
transition can be put on hold for some time; did the same while calling
XLogAcceptWrites().

Regards,
Amul

One of the
patch (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch)
from the latest patchset does not apply successfully.

http://cfbot.cputube.org/patch_32_2602.log

=== applying patch
./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch

Hunk #15 succeeded at 2604 (offset -13 lines).
1 out of 15 hunks FAILED -- saving rejects to file
src/backend/access/nbtree/nbtpage.c.rej
patching file src/backend/access/spgist/spgdoinsert.c

It is a very minor change, so I rebased the patch. Please take a look, if
that works for you.

--
Ibrar Ahmed

Attachments:

v19-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v19-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 5c4f958249de981b46391a0040aaf1e7c767190a Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v18 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 460 insertions(+), 70 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d0..661e88da372 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2104,6 +2105,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2427,6 +2430,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2987,6 +2992,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3738,6 +3745,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3911,6 +3920,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4843,6 +4854,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5633,6 +5646,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5791,6 +5806,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5899,6 +5916,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -6019,6 +6038,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6049,6 +6069,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6059,7 +6083,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e66..5158e3ee24e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/storage.h"
@@ -758,6 +759,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1200,6 +1202,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1215,7 +1220,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1485,6 +1490,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1502,7 +1510,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1931,6 +1939,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1938,6 +1947,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1963,7 +1975,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 1edb9f95797..6ef57e8384c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1901,13 +1904,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2469,6 +2475,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index fc744cf9fd..9a09850640 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;

diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16fb..31404dfdb70 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2195,6 +2198,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2293,6 +2299,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 824a19d62e0..5dae4667975 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index df872d42ffa..ef5f7ef0a2c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 49286a16c31..309e4b25f81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1034,7 +1034,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2865,9 +2865,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -8967,6 +8969,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9218,6 +9222,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9375,6 +9382,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10027,7 +10036,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10041,10 +10050,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10066,8 +10075,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 170de288c2c..bb4cb53dcb6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,6 +924,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..909c3e75107 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3820,13 +3820,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e3082..2ee57769835 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v19-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v19-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 70f638966e2e26239eec09ae57d8801f67465ee2 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v18 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 438 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +-
 src/backend/access/transam/xlog.c        | 292 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 +
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   6 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 893 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..824a19d62e0
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (XLogWriteAllowedIsDone())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c83aa16f2ce..df872d42ffa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb87324..49286a16c31 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -730,6 +732,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesDone indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.
+	 */
+	bool		xlogAllowWritesDone;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6179,6 +6187,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesDone flag.
+ */
+bool
+XLogWriteAllowedIsDone(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->xlogAllowWritesDone;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6382,8 +6400,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6532,13 +6550,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7772,16 +7799,133 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (XLogWriteAllowedIsDone())
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7799,15 +7943,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7838,6 +7987,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7917,63 +8068,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8002,27 +8103,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->xlogAllowWritesDone = true;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8237,9 +8343,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8258,9 +8364,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8282,6 +8399,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8571,9 +8694,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8586,6 +8713,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fb1116d09ad..5d3038a460b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1526,6 +1526,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb4..170de288c2c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -688,6 +690,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1335,3 +1340,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9259dc9d3e1..4c17ffab008 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4306,6 +4306,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b507..c3c5ec641cf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,6 +223,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe26..8c3067b2773 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2049,6 +2051,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12227,4 +12241,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd0..b0033f5a599 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -314,6 +315,7 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool XLogWriteAllowedIsDone(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -325,6 +327,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2ccc3e7c7c7..f386449571c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11395,6 +11395,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e0c70d221be..8ac620d1188 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1067,6 +1067,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95aefa1d..e8c0fb54547 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2700,6 +2700,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v19-0003-WIP-Documentation.patchapplication/octet-stream; name=v19-0003-WIP-Documentation.patchDownload

From fa91ccc9ca9bf6c22c21b5e362129787206078cf Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v18 3/3] WIP - Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 33 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ece09699ef8..f2246379b73 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24744,9 +24744,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24832,6 +24832,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..4e3d39fe94d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,37 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write-ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then that session will be terminated.  This
+    is useful for HA setup where the master server needs to stop accepting WAL
+    writes immediately and kick out any transaction expecting WAL writes at the
+    end, in case of network down on master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#100

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Ibrar Ahmed (#99)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

On Tue, Mar 9, 2021 at 3:31 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[....]

One of the patch (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the latest patchset does not apply successfully.

http://cfbot.cputube.org/patch_32_2602.log

=== applying patch ./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch

Hunk #15 succeeded at 2604 (offset -13 lines).
1 out of 15 hunks FAILED -- saving rejects to file src/backend/access/nbtree/nbtpage.c.rej
patching file src/backend/access/spgist/spgdoinsert.c

It is a very minor change, so I rebased the patch. Please take a look, if that works for you.

Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.

Regards,
Amul

Attachments:

v20-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v20-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 456078f67920753620b7908014d0da57c04c34b0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v20 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 438 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  37 +-
 src/backend/access/transam/xlog.c        | 292 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  19 +
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   6 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 893 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..824a19d62e0
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (XLogWriteAllowedIsDone())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6395a9b2408..2fcec7b4ce6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1988,23 +1988,28 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records,
+	 * either because we are still in recovery or because ALTER SYSTEM READ
+	 * ONLY has been executed, force this to be a read-only transaction.
+	 * We have lower level defences in XLogBeginInsert() and elsewhere to stop
+	 * us from modifying data during recovery when !XLogInsertAllowed(), but
+	 * this gives the normal indication to the user that the transaction is
+	 * read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to
+	 * decide whether to permit (1) relying on existing killed-tuple markings
+	 * and (2) further killing of index tuples. Even when WAL is prohibited
+	 * on the master, it's still the master, so the former is OK; and since
+	 * killing index tuples doesn't generate WAL, the latter is also OK.
+	 * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f4d1ce5deae..e3e2642323f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesDone indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.
+	 */
+	bool		xlogAllowWritesDone;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6247,6 +6255,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesDone flag.
+ */
+bool
+XLogWriteAllowedIsDone(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->xlogAllowWritesDone;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6462,8 +6480,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6612,13 +6630,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7854,16 +7881,133 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (XLogWriteAllowedIsDone())
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7881,15 +8025,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7920,6 +8069,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7999,63 +8150,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8084,27 +8185,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->xlogAllowWritesDone = true;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8319,9 +8425,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8340,9 +8446,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8364,6 +8481,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8653,9 +8776,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8668,6 +8795,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7bb..92ed3d1d84f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1539,6 +1539,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5907a7befc5..441ffcbf891 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,17 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951d..70cd43619ff 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4312,6 +4312,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd2..f495dbedc20 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2058,6 +2060,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12226,4 +12240,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..1fe7dde0504
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states. 	A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6d384d3ce6d..ab63cc998f7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -315,6 +315,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +324,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool XLogWriteAllowedIsDone(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +336,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 93393fcfd4f..82db3ae7063 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11399,6 +11399,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c048028..771de8135b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1079,6 +1079,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61cf4eae1f2..b054674f04f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2704,6 +2704,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v20-0003-WIP-Documentation.patchapplication/x-patch; name=v20-0003-WIP-Documentation.patchDownload

From 6255effea219e66c1e8d3e1fa270eaaf151fc64d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v20 3/3] WIP - Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 33 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9492a3c6b92..215266af71b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24744,9 +24744,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24832,6 +24832,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..4e3d39fe94d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,37 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write-ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then that session will be terminated.  This
+    is useful for HA setup where the master server needs to stop accepting WAL
+    writes immediately and kick out any transaction expecting WAL writes at the
+    end, in case of network down on master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..fe7a84f93da 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write-ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing ALTER SYSTEM READ ONLY.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but ALTER
+SYSTEM READ ONLY can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+	Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+	permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v20-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v20-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From b725f3440de2c273027cf8c38feadd865db4e4f2 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v20 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 40 files changed, 460 insertions(+), 70 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3fb231adf45 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,9 +474,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..f251d6fc388 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,10 +577,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d0..661e88da372 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2104,6 +2105,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2427,6 +2430,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2987,6 +2992,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3738,6 +3745,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3911,6 +3920,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4843,6 +4854,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5633,6 +5646,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5791,6 +5806,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5899,6 +5916,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -6019,6 +6038,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6049,6 +6069,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6059,7 +6083,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 366c122bd1e..cb86d9e85e4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1487,6 +1492,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1504,7 +1512,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1934,6 +1942,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1941,6 +1950,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1966,7 +1978,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0bc86943ebc..9ee96e2c413 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index fc744cf9fdb..9a09850640c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16fb..31404dfdb70 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2195,6 +2198,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2293,6 +2299,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 824a19d62e0..5dae4667975 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2fcec7b4ce6..6fe57a4147b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1343,6 +1344,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1703,6 +1706,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e3e2642323f..633ef40c026 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9049,6 +9051,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9300,6 +9304,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9457,6 +9464,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10109,7 +10118,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10123,10 +10132,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10148,8 +10157,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 441ffcbf891..c2473177ff8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c93..90b32afc04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3826,13 +3826,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 013850ac288..925c1a7a5bd 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#101

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#100)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

I made a few changes and fixes in the attached version.
The document patch is now ready for review.

Regards,
Amul

Show quoted text

On Mon, Mar 15, 2021 at 12:55 PM Amul Sul <sulamul@gmail.com> wrote:

On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

On Tue, Mar 9, 2021 at 3:31 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[....]

One of the patch (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the latest patchset does not apply successfully.

http://cfbot.cputube.org/patch_32_2602.log

=== applying patch ./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch

Hunk #15 succeeded at 2604 (offset -13 lines).
1 out of 15 hunks FAILED -- saving rejects to file src/backend/access/nbtree/nbtpage.c.rej
patching file src/backend/access/spgist/spgdoinsert.c

It is a very minor change, so I rebased the patch. Please take a look, if that works for you.

Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.

Regards,
Amul

Attachments:

v21-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v21-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From d635bfc907f0d81f36294fef82db5b55a59706f5 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v21 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 438 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 292 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 ++
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |   6 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 893 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..824a19d62e0
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (XLogWriteAllowedIsDone())
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6395a9b2408..b03c30413f9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1988,23 +1988,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f4d1ce5deae..e3e2642323f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesDone indicates if the last recovery checkpoint and
+	 * required wal write to start the normal server are skipped.
+	 */
+	bool		xlogAllowWritesDone;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -6247,6 +6255,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesDone flag.
+ */
+bool
+XLogWriteAllowedIsDone(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->xlogAllowWritesDone;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6462,8 +6480,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6612,13 +6630,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7854,16 +7881,133 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (XLogWriteAllowedIsDone())
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7881,15 +8025,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7920,6 +8069,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7999,63 +8150,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8084,27 +8185,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->xlogAllowWritesDone = true;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8319,9 +8425,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8340,9 +8446,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8364,6 +8481,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8653,9 +8776,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8668,6 +8795,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7bb..92ed3d1d84f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1539,6 +1539,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5907a7befc5..33f55ca8bba 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951d..70cd43619ff 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4312,6 +4312,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd2..f495dbedc20 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -227,6 +227,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -618,6 +619,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2058,6 +2060,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12226,4 +12240,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..09b54710b60
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6d384d3ce6d..ab63cc998f7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -315,6 +315,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +324,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool XLogWriteAllowedIsDone(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +336,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 93393fcfd4f..82db3ae7063 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11399,6 +11399,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c048028..771de8135b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1079,6 +1079,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61cf4eae1f2..b054674f04f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2704,6 +2704,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v21-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v21-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From e6ee04e063acebe73740973538a17d0744726f16 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v21 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  6 ++--
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 41 files changed, 463 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d0..661e88da372 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2104,6 +2105,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2427,6 +2430,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2987,6 +2992,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3738,6 +3745,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3911,6 +3920,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4843,6 +4854,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5633,6 +5646,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5791,6 +5806,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5899,6 +5916,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -6019,6 +6038,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6049,6 +6069,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6059,7 +6083,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 366c122bd1e..cb86d9e85e4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -759,6 +760,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1201,6 +1203,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1216,7 +1221,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1487,6 +1492,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1504,7 +1512,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1934,6 +1942,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1941,6 +1950,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1966,7 +1978,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0bc86943ebc..9ee96e2c413 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index fc744cf9fdb..9a09850640c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16fb..31404dfdb70 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2195,6 +2198,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2293,6 +2299,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 824a19d62e0..5dae4667975 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b03c30413f9..c54f67da3c7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1343,6 +1344,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1703,6 +1706,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e3e2642323f..633ef40c026 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9049,6 +9051,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9300,6 +9304,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9457,6 +9464,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10109,7 +10118,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10123,10 +10132,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10148,8 +10157,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 33f55ca8bba..56636b51df9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c93..90b32afc04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3826,13 +3826,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 09b54710b60..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -70,9 +70,9 @@ AssertWALPermitted(void)
 }
 
 /*
- * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
- * part of the code that can only be reached with an XID assigned is never
- * reached when WAL is prohibited.
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
  */
 static inline void
 AssertWALPermittedHaveXID(void)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 013850ac288..925c1a7a5bd 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v21-0003-Documentation.patchapplication/x-patch; name=v21-0003-Documentation.patchDownload

From 7a87f34538a3f71a7c3fa032fbfe96bf31d024ca Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v21 3/3] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9492a3c6b92..215266af71b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24744,9 +24744,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24832,6 +24832,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..ecda6b24a9c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#102

Prabhat Sahu

prabhat.sahu@enterprisedb.com

almost 5 years ago

In reply to: Amul Sul (#100)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi all,
While testing this feature with v20-patch, the server is crashing with
below steps.

Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);

-- Slave:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)

postgres=# SELECT pg_prohibit_wal(true);
pg_prohibit_wal
-----------------

(1 row)

postgres=# SELECT pg_prohibit_wal(false);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

-- Below are the stack trace:
[prabhat@localhost bin]$ gdb -q -c /tmp/data_slave/core.35273 postgres
Reading symbols from
/home/prabhat/PG/PGsrcNew/postgresql/inst/bin/postgres...done.
[New LWP 35273]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer
'.
Program terminated with signal 6, Aborted.
#0 0x00007fa876233387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64
libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64
openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00007fa876233387 in raise () from /lib64/libc.so.6
#1 0x00007fa876234a78 in abort () from /lib64/libc.so.6
#2 0x0000000000aea31c in ExceptionalCondition (conditionName=0xb8c998
"ThisTimeLineID != 0 || IsBootstrapProcessingMode()",
errorType=0xb8956d "FailedAssertion", fileName=0xb897c0 "xlog.c",
lineNumber=8611) at assert.c:69
#3 0x0000000000588eb5 in InitXLOGAccess () at xlog.c:8611
#4 0x0000000000588ae6 in LocalSetXLogInsertAllowed () at xlog.c:8483
#5 0x00000000005881bb in XLogAcceptWrites (needChkpt=true, xlogreader=0x0,
EndOfLog=0, EndOfLogTLI=0) at xlog.c:8008
#6 0x00000000005751ed in ProcessWALProhibitStateChangeRequest () at
walprohibit.c:361
#7 0x000000000088c69f in CheckpointerMain () at checkpointer.c:355
#8 0x000000000059d7db in AuxiliaryProcessMain (argc=2,
argv=0x7ffd1290d060) at bootstrap.c:455
#9 0x000000000089fc5f in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5416
#10 0x000000000089f782 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5128
#11 <signal handler called>
#12 0x00007fa8762f2983 in __select_nocancel () from /lib64/libc.so.6
#13 0x000000000089b511 in ServerLoop () at postmaster.c:1700
#14 0x000000000089af00 in PostmasterMain (argc=5, argv=0x15b8460) at
postmaster.c:1408
#15 0x000000000079c23a in main (argc=5, argv=0x15b8460) at main.c:209
(gdb)

kindly let me know if you need more inputs on this.

On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:

On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed <ibrar.ahmad@gmail.com>
wrote:

On Tue, Mar 9, 2021 at 3:31 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com>

wrote:

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

[....]

One of the patch

(v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the
latest patchset does not apply successfully.

http://cfbot.cputube.org/patch_32_2602.log

=== applying patch

./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch

Hunk #15 succeeded at 2604 (offset -13 lines).
1 out of 15 hunks FAILED -- saving rejects to file

src/backend/access/nbtree/nbtpage.c.rej

patching file src/backend/access/spgist/spgdoinsert.c

It is a very minor change, so I rebased the patch. Please take a look,

if that works for you.

Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.

Regards,
Amul

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

#103

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Prabhat Sahu (#102)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Mar 19, 2021 at 7:17 PM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:

Hi all,
While testing this feature with v20-patch, the server is crashing with below steps.

Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);

-- Slave:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)

postgres=# SELECT pg_prohibit_wal(true);
pg_prohibit_wal
-----------------

(1 row)

postgres=# SELECT pg_prohibit_wal(false);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Thanks Prabhat.

The assertion failure is due to wrong assumptions for the flag that were used
for the XLogAcceptWrites() call. In the case of standby, the startup process
never reached the place where it could call XLogAcceptWrites() and update the
respective flag. Due to this flag value, pg_prohibit_wal() function does
alter system state in recovery state which is incorrect.

In the attached function I took enum value for that flag so that
pg_prohibit_wal() is only allowed in the recovery mode, iff that flag indicates
that XLogAcceptWrites() has been skipped previously.

Regards,
Amul

Attachments:

v22-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v22-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 4c5ad0231e5b38cb878ea6844c151782c2b133c3 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v22 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 438 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 294 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 ++
 src/backend/postmaster/pgstat.c          |   6 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/pgstat.h                     |   2 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 903 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..99359a98cba
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_SKIPPED)
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6395a9b2408..b03c30413f9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1988,23 +1988,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e1490..66b6c3973f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5224,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6247,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6462,8 +6481,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6612,13 +6631,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7854,16 +7882,134 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7881,15 +8027,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7920,6 +8071,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7999,63 +8152,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8084,27 +8187,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8319,9 +8427,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8340,9 +8448,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8364,6 +8483,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8653,9 +8778,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8668,6 +8797,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7bb..92ed3d1d84f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1539,6 +1539,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5907a7befc5..33f55ca8bba 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f3..f9f0ae1e031 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4312,6 +4312,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_REPLICATION_SLOT_WRITE:
 			event_name = "ReplicationSlotWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 		case WAIT_EVENT_SLRU_FLUSH_SYNC:
 			event_name = "SLRUFlushSync";
 			break;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf45..582f99609d9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3b36a31a475..60ccbe4e126 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -228,6 +228,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -627,6 +628,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2086,6 +2088,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12407,4 +12421,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..09b54710b60
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..d91427e0905 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e259531f60b..5d0cadbf2e9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11403,6 +11403,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c048028..771de8135b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1079,6 +1079,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_SYNC,
 	WAIT_EVENT_REPLICATION_SLOT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE,
 	WAIT_EVENT_SLRU_FLUSH_SYNC,
 	WAIT_EVENT_SLRU_READ,
 	WAIT_EVENT_SLRU_SYNC,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2f0e5..03fbab79ea4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2707,6 +2707,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v22-0003-Documentation.patchapplication/x-patch; name=v22-0003-Documentation.patchDownload

From 110e80a1bd587c1daa3dc7b4ec69b5f7cbbf8439 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v22 3/3] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 68fe6a95b49..24ea744f4a7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24744,9 +24744,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24832,6 +24832,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c01081..ecda6b24a9c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,4 +2329,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v22-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v22-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 692e947a1a157aa747b2ca58ba28b7de47d117d6 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v22 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  6 ++--
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 41 files changed, 463 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e4..17e719e2dfc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -760,6 +761,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b3..49a5ad9de9b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2104,6 +2105,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2427,6 +2430,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2987,6 +2992,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3738,6 +3745,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3911,6 +3920,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4843,6 +4854,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5633,6 +5646,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5791,6 +5806,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5899,6 +5916,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combocid, either.  No need to extract replica identity, or
@@ -6019,6 +6038,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6049,6 +6069,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6059,7 +6083,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89b..d6f4381401f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -777,6 +778,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1219,6 +1221,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1234,7 +1239,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1505,6 +1510,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1522,7 +1530,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1952,6 +1960,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1959,6 +1968,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -1984,7 +1996,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0bc86943ebc..9ee96e2c413 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 530d924bff8..d8856d13de7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 7bd269fd2a0..1f9dd667262 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1109,6 +1114,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1517,6 +1524,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1603,6 +1612,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1788,6 +1799,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b64a24..f17c7bec764 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 99359a98cba..5c08f7f02ed 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -25,6 +25,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b03c30413f9..c54f67da3c7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1343,6 +1344,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1703,6 +1706,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 66b6c3973f1..9ce49a31815 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9051,6 +9053,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9302,6 +9306,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9459,6 +9466,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10111,7 +10120,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10125,10 +10134,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10150,8 +10159,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 33f55ca8bba..56636b51df9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c93..90b32afc04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3826,13 +3826,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 09b54710b60..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -70,9 +70,9 @@ AssertWALPermitted(void)
 }
 
 /*
- * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
- * part of the code that can only be reached with an XID assigned is never
- * reached when WAL is prohibited.
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
  */
 static inline void
 AssertWALPermittedHaveXID(void)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 013850ac288..925c1a7a5bd 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -95,12 +95,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -110,6 +135,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -136,6 +162,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#104

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#103)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Regards,
Amul

Show quoted text

On Mon, Mar 22, 2021 at 12:13 PM Amul Sul <sulamul@gmail.com> wrote:

On Fri, Mar 19, 2021 at 7:17 PM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:

Hi all,
While testing this feature with v20-patch, the server is crashing with below steps.

Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);

-- Slave:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)

postgres=# SELECT pg_prohibit_wal(true);
pg_prohibit_wal
-----------------

(1 row)

postgres=# SELECT pg_prohibit_wal(false);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Thanks Prabhat.

The assertion failure is due to wrong assumptions for the flag that were used
for the XLogAcceptWrites() call. In the case of standby, the startup process
never reached the place where it could call XLogAcceptWrites() and update the
respective flag. Due to this flag value, pg_prohibit_wal() function does
alter system state in recovery state which is incorrect.

In the attached function I took enum value for that flag so that
pg_prohibit_wal() is only allowed in the recovery mode, iff that flag indicates
that XLogAcceptWrites() has been skipped previously.

Regards,
Amul

Attachments:

v23-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v23-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From addd2e9677289236f2248988a58327257af5c4cb Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v23 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  6 ++--
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 41 files changed, 463 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2d8759c6c1a..4dc278b7d26 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 595310ba1b2..210eba8fc0a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2114,6 +2115,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2437,6 +2440,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2997,6 +3002,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3748,6 +3755,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3921,6 +3930,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4853,6 +4864,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5643,6 +5656,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5801,6 +5816,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5909,6 +5926,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6029,6 +6048,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6059,6 +6079,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6069,7 +6093,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702d..c9993ab1210 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -817,6 +818,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1254,6 +1256,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1269,7 +1274,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1540,6 +1545,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1557,7 +1565,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1986,6 +1994,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1993,6 +2002,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2018,7 +2030,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0bc86943ebc..9ee96e2c413 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2e..41d044ec3cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index e14c71abd07..70fcae42bda 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1105,6 +1110,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1513,6 +1520,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1599,6 +1608,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1784,6 +1795,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b64a24..f17c7bec764 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index cf91ba32d01..65c6826d7b8 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87f101d72e2..cfad91661a4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 66b6c3973f1..9ce49a31815 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9051,6 +9053,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9302,6 +9306,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9459,6 +9466,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10111,7 +10120,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10125,10 +10134,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10150,8 +10159,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76affb7b549..ec05110b571 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c93..90b32afc04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3826,13 +3826,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 09b54710b60..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -70,9 +70,9 @@ AssertWALPermitted(void)
 }
 
 /*
- * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
- * part of the code that can only be reached with an XID assigned is never
- * reached when WAL is prohibited.
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
  */
 static inline void
 AssertWALPermittedHaveXID(void)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 6f8251e0b07..18d9fa81458 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -96,12 +96,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -111,6 +136,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -137,6 +163,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v23-0003-Documentation.patchapplication/x-patch; name=v23-0003-Documentation.patchDownload

From 575745dca7d8ad9e2302871520268e780d64fdb5 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v23 3/3] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3cf243a16ad..837584adf48 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24876,9 +24876,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24964,6 +24964,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..b54767beeb0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v23-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v23-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 21e73f485f1b5b237c1110b42277b92c41bbc2a0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v23 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 439 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 294 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 ++
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 905 insertions(+), 132 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..cf91ba32d01
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,439 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_SKIPPED)
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try after sometime again.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c83aa16f2ce..87f101d72e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e1490..66b6c3973f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5224,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6247,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6462,8 +6481,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6612,13 +6631,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7854,16 +7882,134 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7881,15 +8027,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7920,6 +8071,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7999,63 +8152,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8084,27 +8187,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8319,9 +8427,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8340,9 +8448,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8364,6 +8483,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8653,9 +8778,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8668,6 +8797,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316d..92ed9cd5172 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1608,6 +1608,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a4594..76affb7b549 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 16c6f17e235..63023de06e3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index accc1eb5776..b9b76e60024 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -723,6 +723,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c9c9da85f39..a46a4d59b40 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -233,6 +233,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -640,6 +641,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2109,6 +2111,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12455,4 +12469,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..09b54710b60
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..d91427e0905 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 69ffd0c3f4d..f9a79909ba4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11519,6 +11519,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 44448b48ec0..6ac26c9f26c 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -223,7 +223,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6a98064b2bd..13908e1f655 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2706,6 +2706,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

#105

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 5 years ago

In reply to: Amul Sul (#104)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Some minor comments on 0001:
Isn't it "might not be running"?
+ errdetail("Checkpointer might not running."),

Isn't it "Try again after sometime"?
+ errhint("Try after sometime again.")));

Can we have ereport(DEBUG1 just to be consistent(although it doesn't
make any difference from elog(DEBUG1) with the new log messages
introduced in the patch?
+ elog(DEBUG1, "waiting for backends to adopt requested WAL
prohibit state change");

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#106

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Bharath Rupireddy (#105)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Thanks Bharath for your review.

On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Some minor comments on 0001:
Isn't it "might not be running"?
+ errdetail("Checkpointer might not running."),

Ok, fixed in the attached version.

Isn't it "Try again after sometime"?
+ errhint("Try after sometime again.")));

Ok, done.

Can we have ereport(DEBUG1 just to be consistent(although it doesn't
make any difference from elog(DEBUG1) with the new log messages
introduced in the patch?
+ elog(DEBUG1, "waiting for backends to adopt requested WAL
prohibit state change");

I think it's fine; many existing places have used elog(DEBUG1, ....) too.

Regards,
Amul

Attachments:

v24-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v24-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 5b8bed83eecaff6703542637936ce7c31efd4138 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v24 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 18 ++++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  6 ++--
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 41 files changed, 463 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2d8759c6c1a..4dc278b7d26 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 595310ba1b2..210eba8fc0a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2114,6 +2115,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2437,6 +2440,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2997,6 +3002,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3748,6 +3755,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3921,6 +3930,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4853,6 +4864,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5643,6 +5656,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5801,6 +5816,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5909,6 +5926,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6029,6 +6048,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6059,6 +6079,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6069,7 +6093,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406e..92ea665f989 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -233,6 +234,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -287,6 +289,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -320,7 +326,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702d..c9993ab1210 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -817,6 +818,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
+	bool		needwal = RelationNeedsWAL(onerel);
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
@@ -1254,6 +1256,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1269,7 +1274,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (needwal &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
@@ -1540,6 +1545,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buf);
@@ -1557,7 +1565,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -1986,6 +1994,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(onerel);
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -1993,6 +2002,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 blkno, InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2018,7 +2030,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0bc86943ebc..9ee96e2c413 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2e..41d044ec3cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index e14c71abd07..70fcae42bda 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1105,6 +1110,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1513,6 +1520,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1599,6 +1608,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1784,6 +1795,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183b..e9aa3e1e01c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b64a24..f17c7bec764 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index b40c05f1a26..1f389574436 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87f101d72e2..cfad91661a4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 66b6c3973f1..9ce49a31815 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9051,6 +9053,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9302,6 +9306,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9459,6 +9466,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10111,7 +10120,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10125,10 +10134,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10150,8 +10159,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76affb7b549..ec05110b571 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c93..90b32afc04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3826,13 +3826,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 09b54710b60..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -70,9 +70,9 @@ AssertWALPermitted(void)
 }
 
 /*
- * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
- * part of the code that can only be reached with an XID assigned is never
- * reached when WAL is prohibited.
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
  */
 static inline void
 AssertWALPermittedHaveXID(void)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 6f8251e0b07..18d9fa81458 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -96,12 +96,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -111,6 +136,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -137,6 +163,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v24-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v24-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From f062b2ec03d952a51edcce0f284217de8e3eb5d6 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v24 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 439 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 294 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 ++
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 905 insertions(+), 132 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..b40c05f1a26
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,439 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_SKIPPED)
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c83aa16f2ce..87f101d72e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e1490..66b6c3973f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5224,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6247,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6462,8 +6481,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6612,13 +6631,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7854,16 +7882,134 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7881,15 +8027,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7920,6 +8071,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7999,63 +8152,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8084,27 +8187,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8319,9 +8427,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8340,9 +8448,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8364,6 +8483,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8653,9 +8778,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8668,6 +8797,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316d..92ed9cd5172 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1608,6 +1608,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a4594..76affb7b549 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e4..58e2c7fe339 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -100,7 +101,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -526,8 +526,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -593,24 +593,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 16c6f17e235..63023de06e3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index accc1eb5776..b9b76e60024 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -723,6 +723,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c9c9da85f39..a46a4d59b40 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -233,6 +233,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -640,6 +641,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2109,6 +2111,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12455,4 +12469,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..09b54710b60
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..d91427e0905 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 69ffd0c3f4d..f9a79909ba4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11519,6 +11519,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4543', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8e..9e834247871 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -48,12 +48,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 44448b48ec0..6ac26c9f26c 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -223,7 +223,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6a98064b2bd..13908e1f655 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2706,6 +2706,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v24-0003-Documentation.patchapplication/x-patch; name=v24-0003-Documentation.patchDownload

From 3f1dac2d00e87ae4d52bca910cf49791b310bdc7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v24 3/3] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3cf243a16ad..837584adf48 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24876,9 +24876,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24964,6 +24964,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..b54767beeb0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

#107

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#106)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Rotten again, attached the rebased version.

Regards,
Amul

Show quoted text

On Mon, Apr 5, 2021 at 5:27 PM Amul Sul <sulamul@gmail.com> wrote:

On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Thanks Bharath for your review.

On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Some minor comments on 0001:
Isn't it "might not be running"?
+ errdetail("Checkpointer might not running."),

Ok, fixed in the attached version.

Isn't it "Try again after sometime"?
+ errhint("Try after sometime again.")));

Ok, done.

Can we have ereport(DEBUG1 just to be consistent(although it doesn't
make any difference from elog(DEBUG1) with the new log messages
introduced in the patch?
+ elog(DEBUG1, "waiting for backends to adopt requested WAL
prohibit state change");

I think it's fine; many existing places have used elog(DEBUG1, ....) too.

Regards,
Amul

Attachments:

v25-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v25-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From d0b30087a29f7cb10b1f3f5704793133b69fc183 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v25 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 439 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 294 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 ++
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 905 insertions(+), 132 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..b40c05f1a26
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,439 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_SKIPPED)
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c83aa16f2ce..87f101d72e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1d4415a433..e39082e5e0a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5224,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6247,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6458,8 +6477,8 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
+	bool		needChkpt = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6608,13 +6627,22 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7850,16 +7878,134 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7877,15 +8023,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7916,6 +8067,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7995,63 +8148,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8080,27 +8183,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8315,9 +8423,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8336,9 +8444,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8360,6 +8479,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8649,9 +8774,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8664,6 +8793,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316d..92ed9cd5172 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1608,6 +1608,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13eb..0057dfa8c46 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a4594..76affb7b549 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eac68951414..77950e04600 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 16c6f17e235..63023de06e3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index accc1eb5776..b9b76e60024 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -723,6 +723,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c9c9da85f39..a46a4d59b40 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -233,6 +233,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -640,6 +641,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2109,6 +2111,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12455,4 +12469,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..09b54710b60
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..d91427e0905 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4309fa40dd2..313a90d7c1d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11543,6 +11543,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 44448b48ec0..6ac26c9f26c 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -223,7 +223,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b26e81dcbf7..9d3a581a3f3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2706,6 +2706,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v25-0003-Documentation.patchapplication/octet-stream; name=v25-0003-Documentation.patchDownload

From 07234cc1665e5e309dcc21e758d4bc9832405bda Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v25 3/3] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c6a45d9e55c..912a6c47740 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24880,9 +24880,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -24988,6 +24988,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         however only superusers can terminate superuser backends.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..b54767beeb0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v25-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v25-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 645351e376bd36e632747a5cea1dea2410f12bcc Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v25 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 22 ++++++++++++---
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  6 ++--
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 41 files changed, 466 insertions(+), 74 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2d8759c6c1a..4dc278b7d26 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf429..6c20bc53fc5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9cbc161d7a9..04df67c510d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2114,6 +2115,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2437,6 +2440,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2997,6 +3002,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3748,6 +3755,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3921,6 +3930,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4853,6 +4864,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5643,6 +5656,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5801,6 +5816,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5909,6 +5926,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6029,6 +6048,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6059,6 +6079,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6069,7 +6093,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f75502ca2c0..33c07961d2f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 446e3bc4523..af35c3cd47f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1242,6 +1243,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1257,8 +1263,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1837,8 +1842,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1863,7 +1873,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2144,6 +2154,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2154,6 +2165,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2183,7 +2197,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0bc86943ebc..9ee96e2c413 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2e..41d044ec3cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f06..df9506b7cd5 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1130,6 +1135,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1538,6 +1545,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1624,6 +1633,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1809,6 +1820,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b64a24..f17c7bec764 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index b40c05f1a26..1f389574436 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87f101d72e2..cfad91661a4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e39082e5e0a..6bf4afb1cfa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9047,6 +9049,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9298,6 +9302,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9455,6 +9462,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10107,7 +10116,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10121,10 +10130,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10146,8 +10155,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76affb7b549..ec05110b571 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c93..90b32afc04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3826,13 +3826,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 09b54710b60..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -70,9 +70,9 @@ AssertWALPermitted(void)
 }
 
 /*
- * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
- * part of the code that can only be reached with an XID assigned is never
- * reached when WAL is prohibited.
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
  */
 static inline void
 AssertWALPermittedHaveXID(void)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af5..f18232cbf53 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -97,12 +97,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -112,6 +137,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -138,6 +164,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#108

Amul Sul

sulamul@gmail.com

almost 5 years ago

In reply to: Amul Sul (#107)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Rebased again.

Show quoted text

On Wed, Apr 7, 2021 at 12:38 PM Amul Sul <sulamul@gmail.com> wrote:

Rotten again, attached the rebased version.

Regards,
Amul

On Mon, Apr 5, 2021 at 5:27 PM Amul Sul <sulamul@gmail.com> wrote:

On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Thanks Bharath for your review.

On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Some minor comments on 0001:
Isn't it "might not be running"?
+ errdetail("Checkpointer might not running."),

Ok, fixed in the attached version.

Isn't it "Try again after sometime"?
+ errhint("Try after sometime again.")));

Ok, done.

Can we have ereport(DEBUG1 just to be consistent(although it doesn't
make any difference from elog(DEBUG1) with the new log messages
introduced in the patch?
+ elog(DEBUG1, "waiting for backends to adopt requested WAL
prohibit state change");

I think it's fine; many existing places have used elog(DEBUG1, ....) too.

Regards,
Amul

Attachments:

v26-0001-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v26-0001-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From c0636fe657f0239526e69abc1e980f64e421f82b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v26 1/3] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set ready only state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back to read-write.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery(XXX:
    need some discussion on this as well) but the end-of-recovery
    checkpoint, necessary wal write and control file update to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL-Permitted mode. Until then "Database cluster
    state" will be "in crash recovery".

 7. Altering WAL-Prohibited mode is restricted on standby server except the
    "in crash recovery" state described in the previous point.

 8. The presence of RecoverySignalFile will implicitly pull out the server
    from the read-only (wal prohibited) state permanently.

 9. Add wal_prohibited GUC show the system state -- will true when system
    is wal prohibited or in recovery.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 443 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 293 ++++++++++-----
 src/backend/catalog/system_views.sql     |   2 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  26 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         | 104 ++++++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 23 files changed, 908 insertions(+), 132 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 39f9d4e77d4..c5259157e75 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..7128f2da8e7
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,443 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("system is now read only"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit;
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.  In startup process, we skip the end-of-recovery
+	 * checkpoint, and related wal write operation while booting read only
+	 * (wal prohibited) server, which should be completed before changing the
+	 * system state to read write.  To disallow any other backend from writing
+	 * a wal record before the end of crash recovery checkpoint finishes, we
+	 * let the server in recovery mode.
+	 */
+	if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_SKIPPED)
+		PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	if (PG_ARGISNULL(0))
+		PG_RETURN_VOID();
+
+	walprohibit = PG_GETARG_BOOL(0);
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read write is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to read only is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter =
+			pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	WALProhibitState cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+	return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
+			cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/* Return to "check" state  */
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * There won't be any other process for the final state transition so that
+	 * the shared wal prohibit state counter shouldn't have been changed by
+	 * now.
+	 */
+	Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+	/* Increment wal prohibit state counter in share memory. */
+	wal_prohibit_counter =
+		pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+
+	/* Should have set counter for the final state */
+	cur_state = GetWALProhibitState(wal_prohibit_counter);
+	Assert(cur_state == WALPROHIBIT_STATE_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_READ_WRITE);
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("system is now read only")));
+	else
+		ereport(LOG, (errmsg("system is now read write")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e8..6c609e0e4b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e29..7891c0b962a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -250,9 +251,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -734,6 +736,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5233,6 +5241,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6256,6 +6265,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6466,7 +6485,7 @@ StartupXLOG(void)
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
-	bool		promoted = false;
+	bool		needChkpt = false;
 	struct stat st;
 
 	/*
@@ -6616,13 +6635,21 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a read only
+		 * (wal prohibited) state.
+		 */
+		ControlFile->wal_prohibited = false;
+	}
+
 	/* Set up XLOG reader facility */
 	xlogreader =
 		XLogReaderAllocate(wal_segment_size, NULL, wal_segment_close);
@@ -7877,16 +7904,134 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	needChkpt = InRecovery;
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will
+	 * be written later in XLogAcceptWrites.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(needChkpt);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the system is read only")));
+	}
+	else
+		XLogAcceptWrites(needChkpt, xlogreader, EndOfLog, EndOfLogTLI);
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+}
+
+/*
+ * It is an end part of StartupXLOG doing wal writes necessary before starting a
+ * server normally.  These operations are skipped in startup sometimes if the
+ * system is started in wal prohibited state and will be performed by the
+ * checkpointer while changing the system to wal permitted state.
+ */
+void
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state then these operations cannot be
+	 * performed.
+	 */
+	Assert(!IsWALProhibited());
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7904,15 +8049,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7943,6 +8093,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -8022,63 +8174,13 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
 	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8107,27 +8209,32 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
 	 */
 	if (promoted)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8342,9 +8449,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8363,9 +8470,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8387,6 +8505,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8676,9 +8800,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8691,6 +8819,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the system is read only")));
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 451db2ee0a0..dd1776a1810 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1633,6 +1633,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a799544738e..c7b5758762b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -701,10 +701,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a4594..76affb7b549 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 47847563ef0..3542d8c913e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -226,6 +227,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eac68951414..77950e04600 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 16c6f17e235..63023de06e3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 89b5b8b7b9d..5e33d878ce9 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -726,6 +726,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d0a51b507d7..f4ed3c1627a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -235,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -643,6 +644,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2142,6 +2144,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL,
+			gettext_noop("Shows whether the system is read only."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12566,4 +12580,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (!XLogInsertAllowed())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..09b54710b60
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,104 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
+ * part of the code that can only be reached with an XID assigned is never
+ * reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f542af0a262..5247d225b84 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -168,6 +168,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -316,6 +324,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -324,6 +333,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -335,6 +345,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern void XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f4957653ae6..098c2bbdcd7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11560,6 +11560,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 47accc5ffe2..624ee55ff3e 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -224,7 +224,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff677d4b..fa39d2e690d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2708,6 +2708,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v26-0003-Documentation.patchapplication/octet-stream; name=v26-0003-Documentation.patchDownload

From 2724f096d343a56777b7380fe1789f1a4103ff0e Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v26 3/3] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d2011634075..fa185b3d750 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24898,9 +24898,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25034,6 +25034,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         is emitted and <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..b54767beeb0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is only allowed to run read-only query, GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v26-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v26-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From ac200924dd8090c727286b5745dec351a749631f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v26 2/3] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 +++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 +++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++++-
 src/backend/access/gist/gist.c            | 25 +++++++++++++----
 src/backend/access/gist/gistvacuum.c      | 13 +++++++--
 src/backend/access/hash/hash.c            | 19 +++++++++++--
 src/backend/access/hash/hashinsert.c      |  9 +++++-
 src/backend/access/hash/hashovfl.c        | 22 ++++++++++++---
 src/backend/access/hash/hashpage.c        |  9 ++++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++++--
 src/backend/access/heap/vacuumlazy.c      | 22 ++++++++++++---
 src/backend/access/heap/visibilitymap.c   | 22 +++++++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++++++--
 src/backend/access/transam/multixact.c    |  5 +++-
 src/backend/access/transam/twophase.c     |  9 ++++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 +++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 28 +++++++++++++------
 src/backend/access/transam/xloginsert.c   | 13 +++++++--
 src/backend/commands/sequence.c           | 16 +++++++++++
 src/backend/commands/variable.c           |  9 ++++--
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 ++++---
 src/backend/storage/freespace/freespace.c | 10 ++++++-
 src/backend/storage/lmgr/lock.c           |  6 ++--
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          |  6 ++--
 src/include/miscadmin.h                   | 27 ++++++++++++++++++
 41 files changed, 466 insertions(+), 74 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index f885f3ab3af..ffcb5d98747 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 03d4abc938b..d8fd499f072 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2128,6 +2129,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2451,6 +2454,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -3011,6 +3016,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3762,6 +3769,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3935,6 +3944,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4867,6 +4878,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5657,6 +5670,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5815,6 +5830,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5923,6 +5940,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6043,6 +6062,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6073,6 +6093,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6083,7 +6107,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 0c8e49d3e6c..bd2cf50f20a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f592e71364d..cdfb81d0363 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1292,6 +1293,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1307,8 +1313,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1913,8 +1918,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1939,7 +1949,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2364,6 +2374,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2374,6 +2385,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2404,7 +2418,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2e..41d044ec3cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f06..df9506b7cd5 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1130,6 +1135,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1538,6 +1545,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1624,6 +1633,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1809,6 +1820,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..1e622d9fefd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2948,7 +2951,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b6581349a35..d3400352229 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2210,6 +2213,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2308,6 +2314,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 7128f2da8e7..ad5e4ae703c 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6c609e0e4b4..b0363e82c77 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7891c0b962a..6ffc1b21758 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1031,7 +1031,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2896,9 +2896,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9073,6 +9075,8 @@ CreateCheckPoint(int flags)
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
+	if (!RecoveryInProgress() && !XLogInsertAllowed())
+		elog(ERROR, "can't create a checkpoint while system is read only ");
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
@@ -9324,6 +9328,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9481,6 +9488,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10133,7 +10142,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10147,10 +10156,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10172,8 +10181,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b4237..aa1b0c12000 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76affb7b549..ec05110b571 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0c5b87864b9..8335382870f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3904,13 +3904,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index cfa0414e5ab..15e1e24deeb 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -286,12 +287,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -306,7 +314,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..a615b37b4ee 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while system is read only",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects while system is read only")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 09b54710b60..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -70,9 +70,9 @@ AssertWALPermitted(void)
 }
 
 /*
- * XID-bearing transactions are killed off by "ALTER SYSTEM READ ONLY", so any
- * part of the code that can only be reached with an XID assigned is never
- * reached when WAL is prohibited.
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
  */
 static inline void
 AssertWALPermittedHaveXID(void)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af5..f18232cbf53 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -97,12 +97,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -112,6 +137,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -138,6 +164,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#109

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#108)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Apr 12, 2021 at 10:04 AM Amul Sul <sulamul@gmail.com> wrote:

Rebased again.

I started to look at this today, and didn't get very far, but I have a
few comments. The main one is that I don't think this patch implements
the design proposed in
/messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com

The first part of that proposal said this:

"1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with."

As I read it, the patch clears the read-only state in memory, does not
clear it in the control file, and does not log a message.

The second part of this proposal was:

"2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments."

Now you moved that code, but you also moved (6)
CompleteCommitTsInitialization(), (7) setting the control file to
DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
(9) requesting a checkpoint if we were just promoted. That's not what
was proposed. One result of this is that the server now thinks it's in
recovery even after the startup process has exited.
RecoveryInProgress() is still returning true everywhere. But that is
inconsistent with what Andres and I were recommending in
/messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

I also noticed that 0001 does not compile without 0002, so the
separation into multiple patches is not clean. I would actually
suggest that the first patch in the series should just create
XLogAcceptWrites() with the minimum amount of adjustment to make that
work. That would potentially let us commit that change independently,
which would be good, because then if we accidentally break something,
it'll be easier to pin down to that particular change instead of being
mixed with everything else this needs to change.

--
Robert Haas
EDB: http://www.enterprisedb.com

#110

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#109)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, May 7, 2021 at 1:23 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 12, 2021 at 10:04 AM Amul Sul <sulamul@gmail.com> wrote:

Rebased again.

I started to look at this today, and didn't get very far, but I have a
few comments. The main one is that I don't think this patch implements
the design proposed in
/messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com

The first part of that proposal said this:

"1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with."

As I read it, the patch clears the read-only state in memory, does not
clear it in the control file, and does not log a message.

The state in the control file also gets cleared. Though, after
clearing in memory the state patch doesn't really do the immediate
change to the control file, it relies on the next UpdateControlFile()
to do that.

Regarding log message I think I have skipped that intentionally, to
avoid confusing log as "system is now read write" when we do start as
hot-standby which is not really read-write.

The second part of this proposal was:

"2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments."

Now you moved that code, but you also moved (6)
CompleteCommitTsInitialization(), (7) setting the control file to
DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
(9) requesting a checkpoint if we were just promoted. That's not what
was proposed. One result of this is that the server now thinks it's in
recovery even after the startup process has exited.
RecoveryInProgress() is still returning true everywhere. But that is
inconsistent with what Andres and I were recommending in
/messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

Regarding modified approach, I tried to explain that why I did
this in /messages/by-id/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com

I also noticed that 0001 does not compile without 0002, so the
separation into multiple patches is not clean. I would actually
suggest that the first patch in the series should just create
XLogAcceptWrites() with the minimum amount of adjustment to make that
work. That would potentially let us commit that change independently,
which would be good, because then if we accidentally break something,
it'll be easier to pin down to that particular change instead of being
mixed with everything else this needs to change.

Ok, I will try in the next version.

Regards,
Amul

#111

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#110)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sun, May 9, 2021 at 1:26 AM Amul Sul <sulamul@gmail.com> wrote:

The state in the control file also gets cleared. Though, after
clearing in memory the state patch doesn't really do the immediate
change to the control file, it relies on the next UpdateControlFile()
to do that.

But when will that happen? If you're relying on some very nearby code,
that might be OK, but perhaps a comment is in order. If you're just
thinking it's going to happen eventually, I think that's not good
enough.

Regarding log message I think I have skipped that intentionally, to
avoid confusing log as "system is now read write" when we do start as
hot-standby which is not really read-write.

I think the message should not be phrased that way. In fact, I think
now that we've moved to calling this pg_prohibit_wal() rather than
ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and
maybe some comments and function names as well. Perhaps something
like:

system is read only -> WAL is now prohibited
system is read write -> WAL is no longer prohibited

And then for this particular case, maybe something like:

clearing WAL prohibition because the system is in archive recovery

The second part of this proposal was:

"2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments."

Now you moved that code, but you also moved (6)
CompleteCommitTsInitialization(), (7) setting the control file to
DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
(9) requesting a checkpoint if we were just promoted. That's not what
was proposed. One result of this is that the server now thinks it's in
recovery even after the startup process has exited.
RecoveryInProgress() is still returning true everywhere. But that is
inconsistent with what Andres and I were recommending in
/messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

Regarding modified approach, I tried to explain that why I did
this in /messages/by-id/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com

I am not able to understand what problem you are seeing there. If
we're in crash recovery, then nobody can connect to the database, so
there can't be any concurrent activity. If we're in archive recovery,
we now clear the WAL-is-prohibited flag so that we will go read-write
directly at the end of recovery. We can and should refuse any effort
to call pg_prohibit_wal() during recovery. If we reached the end of
crash recovery and are now permitting read-only connections, why would
anyone be able to write WAL before the system has been changed to
read-write? If that can happen, it's a bug, not a reason to change the
design.

Maybe your concern here is about ordering: the process that is going
to run XLogAcceptWrites() needs to allow xlog writes locally before we
tell other backends that they also can xlog writes; otherwise, some
other records could slip in before UpdateFullPageWrites() and similar
have run, which we probably don't want. But that's why
LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do
what we need in this situation, we should be able to tweak it so it
does.

If your concern is something else, can you spell it out for me again
because I'm not getting it?

--
Robert Haas
EDB: http://www.enterprisedb.com

#112

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#111)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, May 10, 2021 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 9, 2021 at 1:26 AM Amul Sul <sulamul@gmail.com> wrote:

The state in the control file also gets cleared. Though, after
clearing in memory the state patch doesn't really do the immediate
change to the control file, it relies on the next UpdateControlFile()
to do that.

But when will that happen? If you're relying on some very nearby code,
that might be OK, but perhaps a comment is in order. If you're just
thinking it's going to happen eventually, I think that's not good
enough.

Ok.

Regarding log message I think I have skipped that intentionally, to
avoid confusing log as "system is now read write" when we do start as
hot-standby which is not really read-write.

I think the message should not be phrased that way. In fact, I think
now that we've moved to calling this pg_prohibit_wal() rather than
ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and
maybe some comments and function names as well. Perhaps something
like:

system is read only -> WAL is now prohibited
system is read write -> WAL is no longer prohibited

And then for this particular case, maybe something like:

clearing WAL prohibition because the system is in archive recovery

Ok, thanks for the suggestions.

The second part of this proposal was:

"2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments."

Now you moved that code, but you also moved (6)
CompleteCommitTsInitialization(), (7) setting the control file to
DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
(9) requesting a checkpoint if we were just promoted. That's not what
was proposed. One result of this is that the server now thinks it's in
recovery even after the startup process has exited.
RecoveryInProgress() is still returning true everywhere. But that is
inconsistent with what Andres and I were recommending in
/messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

Regarding modified approach, I tried to explain that why I did
this in /messages/by-id/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com

I am not able to understand what problem you are seeing there. If
we're in crash recovery, then nobody can connect to the database, so
there can't be any concurrent activity. If we're in archive recovery,
we now clear the WAL-is-prohibited flag so that we will go read-write
directly at the end of recovery. We can and should refuse any effort
to call pg_prohibit_wal() during recovery. If we reached the end of
crash recovery and are now permitting read-only connections, why would
anyone be able to write WAL before the system has been changed to
read-write? If that can happen, it's a bug, not a reason to change the
design.

Maybe your concern here is about ordering: the process that is going
to run XLogAcceptWrites() needs to allow xlog writes locally before we
tell other backends that they also can xlog writes; otherwise, some
other records could slip in before UpdateFullPageWrites() and similar
have run, which we probably don't want. But that's why
LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do
what we need in this situation, we should be able to tweak it so it
does.

Yes, we don't want any write slip in before UpdateFullPageWrites().
Recently[1], we have decided to let the Checkpointed process call
XLogAcceptWrites() unconditionally.

Here problem is that when a backend executes the
pg_prohibit_wal(false) function to make the system read-write, the wal
prohibited state is set to inprogress(ie.
WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
Next, Checkpointer will convey this system change to all existing
backends using a global barrier, and after that final wal prohibited
state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
While Checkpointer is in the progress of conveying this global
barrier, any new backend can connect at that time and can write a new
record because the inprogress read-write state is equivalent to the
final read-write state iff LocalXLogInsertAllowed != 0 for that
backend. And, that new record could slip in before or in between
records to be written by XLogAcceptWrites().

1] /messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

Regards,
Amul

#113

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#112)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, May 10, 2021 at 10:25 PM Amul Sul <sulamul@gmail.com> wrote:

Yes, we don't want any write slip in before UpdateFullPageWrites().
Recently[1], we have decided to let the Checkpointed process call
XLogAcceptWrites() unconditionally.

Here problem is that when a backend executes the
pg_prohibit_wal(false) function to make the system read-write, the wal
prohibited state is set to inprogress(ie.
WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
Next, Checkpointer will convey this system change to all existing
backends using a global barrier, and after that final wal prohibited
state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
While Checkpointer is in the progress of conveying this global
barrier, any new backend can connect at that time and can write a new
record because the inprogress read-write state is equivalent to the
final read-write state iff LocalXLogInsertAllowed != 0 for that
backend. And, that new record could slip in before or in between
records to be written by XLogAcceptWrites().

1] /messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE
and signaled to the checkpointer. The checkpointer should first call
XLogAcceptWrites and then it should inform other backends through the
global barrier? Are we worried that if we have written the WAL in
XLogAcceptWrites but later if we could not set the state to
WALPROHIBIT_STATE_READ_WRITE? Then maybe we can inform all the
backend first but before setting the state to
WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#114

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#113)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 11:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 10, 2021 at 10:25 PM Amul Sul <sulamul@gmail.com> wrote:

Yes, we don't want any write slip in before UpdateFullPageWrites().
Recently[1], we have decided to let the Checkpointed process call
XLogAcceptWrites() unconditionally.

Here problem is that when a backend executes the
pg_prohibit_wal(false) function to make the system read-write, the wal
prohibited state is set to inprogress(ie.
WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
Next, Checkpointer will convey this system change to all existing
backends using a global barrier, and after that final wal prohibited
state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
While Checkpointer is in the progress of conveying this global
barrier, any new backend can connect at that time and can write a new
record because the inprogress read-write state is equivalent to the
final read-write state iff LocalXLogInsertAllowed != 0 for that
backend. And, that new record could slip in before or in between
records to be written by XLogAcceptWrites().

1] /messages/by-id/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE
and signaled to the checkpointer. The checkpointer should first call
XLogAcceptWrites and then it should inform other backends through the
global barrier? Are we worried that if we have written the WAL in
XLogAcceptWrites but later if we could not set the state to
WALPROHIBIT_STATE_READ_WRITE? Then maybe we can inform all the
backend first but before setting the state to
WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites?

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes. Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites(). Also,
possible that a backend could connect at the same time checkpointer
performing XLogAcceptWrites() and can write a wal.

So, having XLogAcceptWrites() before does not really solve my concern.
Note that the previous patch XLogAcceptWrites() does get called before
global barrier emission.

Please let me know if it is not yet cleared to you, thanks.

Regards,
Amul

#115

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#114)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes. Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites().

Okay, make sense now. But my next question is why do we allow backends
to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
wait until the shared memory state is changed to
WALPROHIBIT_STATE_READ_WRITE?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#116

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#115)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes. Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites().

Okay, make sense now. But my next question is why do we allow backends
to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
wait until the shared memory state is changed to
WALPROHIBIT_STATE_READ_WRITE?

Ok, good question.

Now let's first try to understand the Checkpointer's work.

When Checkpointer sees the wal prohibited state is an in-progress state, then
it first emits the global barrier and waits until all backers absorb that.
After that it set the final requested WAL prohibit state.

When other backends absorb those barriers then appropriate action is taken
(e.g. abort the read-write transaction if moving to read-only) by them. Also,
LocalXLogInsertAllowed flags get reset in it and that backend needs to call
XLogInsertAllowed() to get the right value for it, which further decides WAL
writes permitted or prohibited.

Consider an example that the system is trying to change to read-write and for
that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
Checkpointer starts its work. If we want to treat that system as read-only for
the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
think about the behavior of the backend that has absorbed the barrier and reset
the LocalXLogInsertAllowed flag. That backend eventually going to call
XLogInsertAllowed() to get the actual value for it and by seeing the current
state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
again same as it was before for the read-only state.

Now the question is when this value should get reset again so that backend can
be read-write? We are done with a barrier and that backend never going to come
back to read-write again.

One solution, I think, is to set the final state before emitting the barrier
but as per the current design that should get set after all barrier processing.
Let's see what Robert says on this.

Regards,
Amul

#117

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#116)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 3:38 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes. Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites().

Okay, make sense now. But my next question is why do we allow backends
to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
wait until the shared memory state is changed to
WALPROHIBIT_STATE_READ_WRITE?

Ok, good question.

Now let's first try to understand the Checkpointer's work.

When Checkpointer sees the wal prohibited state is an in-progress state, then
it first emits the global barrier and waits until all backers absorb that.
After that it set the final requested WAL prohibit state.

When other backends absorb those barriers then appropriate action is taken
(e.g. abort the read-write transaction if moving to read-only) by them. Also,
LocalXLogInsertAllowed flags get reset in it and that backend needs to call
XLogInsertAllowed() to get the right value for it, which further decides WAL
writes permitted or prohibited.

Consider an example that the system is trying to change to read-write and for
that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
Checkpointer starts its work. If we want to treat that system as read-only for
the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
think about the behavior of the backend that has absorbed the barrier and reset
the LocalXLogInsertAllowed flag. That backend eventually going to call
XLogInsertAllowed() to get the actual value for it and by seeing the current
state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
again same as it was before for the read-only state.

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

So your problem is that on receiving the barrier we need to call
LocalXLogInsertAllowed() from the backend, but how does that matter?
you can still make IsWALProhibited() return true.

I don't know the complete code so I might be missing something but at
least that is what I would expect from the design POV.

Other than this point, I think the state names READ_ONLY, READ_WRITE
are a bit confusing no? because actually, these states represent
whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
are putting the system under a Read-only state. For example, if you
are doing some write operation on an unlogged table will be allowed, I
guess because that will not generate the WAL until you commit (because
commit generates WAL) right? so practically, we are just blocking the
WAL, not the write operation.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#118

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#117)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 3:38 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes. Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites().

Okay, make sense now. But my next question is why do we allow backends
to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
wait until the shared memory state is changed to
WALPROHIBIT_STATE_READ_WRITE?

Ok, good question.

Now let's first try to understand the Checkpointer's work.

When Checkpointer sees the wal prohibited state is an in-progress state, then
it first emits the global barrier and waits until all backers absorb that.
After that it set the final requested WAL prohibit state.

When other backends absorb those barriers then appropriate action is taken
(e.g. abort the read-write transaction if moving to read-only) by them. Also,
LocalXLogInsertAllowed flags get reset in it and that backend needs to call
XLogInsertAllowed() to get the right value for it, which further decides WAL
writes permitted or prohibited.

Consider an example that the system is trying to change to read-write and for
that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
Checkpointer starts its work. If we want to treat that system as read-only for
the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
think about the behavior of the backend that has absorbed the barrier and reset
the LocalXLogInsertAllowed flag. That backend eventually going to call
XLogInsertAllowed() to get the actual value for it and by seeing the current
state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
again same as it was before for the read-only state.

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

Yes, that's what the current patch is doing from the first patch version.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as described in
my previous reply.

So your problem is that on receiving the barrier we need to call
LocalXLogInsertAllowed() from the backend, but how does that matter?
you can still make IsWALProhibited() return true.

Note that LocalXLogInsertAllowed is a local flag for a backend, not a
function, and in the server code at every place, we don't rely on
IsWALProhibited() instead we do rely on LocalXLogInsertAllowed
flags before wal writes and that check made via XLogInsertAllowed().

I don't know the complete code so I might be missing something but at
least that is what I would expect from the design POV.

Other than this point, I think the state names READ_ONLY, READ_WRITE
are a bit confusing no? because actually, these states represent
whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
are putting the system under a Read-only state. For example, if you
are doing some write operation on an unlogged table will be allowed, I
guess because that will not generate the WAL until you commit (because
commit generates WAL) right? so practically, we are just blocking the
WAL, not the write operation.

This read-only and read-write are the wal prohibited states though we
are using for read-only/read-write system in the discussion and the
complete macro name is WALPROHIBIT_STATE_READ_ONLY and
WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make
implementation confusing.

Regards,
Amul

#119

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#118)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

Yes, that's what the current patch is doing from the first patch version.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as described in
my previous reply.

Okay, but while browsing the code I do not see any direct if condition
based on the "LocalXLogInsertAllowed" variable, can you point me to
some references?
I only see one if check on this variable and that is in
XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
you are already checking IsWALProhibited. No?

Other than this point, I think the state names READ_ONLY, READ_WRITE
are a bit confusing no? because actually, these states represent
whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
are putting the system under a Read-only state. For example, if you
are doing some write operation on an unlogged table will be allowed, I
guess because that will not generate the WAL until you commit (because
commit generates WAL) right? so practically, we are just blocking the
WAL, not the write operation.

This read-only and read-write are the wal prohibited states though we
are using for read-only/read-write system in the discussion and the
complete macro name is WALPROHIBIT_STATE_READ_ONLY and
WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make
implementation confusing.

Fine, I am not too particular about these names.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#120

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#119)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

Yes, that's what the current patch is doing from the first patch version.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as described in
my previous reply.

Okay, but while browsing the code I do not see any direct if condition
based on the "LocalXLogInsertAllowed" variable, can you point me to
some references?
I only see one if check on this variable and that is in
XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
you are already checking IsWALProhibited. No?

I am not sure I understood this. Where am I checking IsWALProhibited()?

IsWALProhibited() is called by XLogInsertAllowed() once when
LocalXLogInsertAllowed is in a reset state, and that result will be
cached in LocalXLogInsertAllowed and will be used in the subsequent
XLogInsertAllowed() call.

Regards,
Amul

#121

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#120)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 6:56 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

Yes, that's what the current patch is doing from the first patch version.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as described in
my previous reply.

Okay, but while browsing the code I do not see any direct if condition
based on the "LocalXLogInsertAllowed" variable, can you point me to
some references?
I only see one if check on this variable and that is in
XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
you are already checking IsWALProhibited. No?

I am not sure I understood this. Where am I checking IsWALProhibited()?

IsWALProhibited() is called by XLogInsertAllowed() once when
LocalXLogInsertAllowed is in a reset state, and that result will be
cached in LocalXLogInsertAllowed and will be used in the subsequent
XLogInsertAllowed() call.

Okay, got what you were trying to say. But that can be easily
fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
then what we can do is don't allow to write the WAL but let's not set
the LocalXLogInsertAllowed to 0. So until we are in the intermediate
state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
on GetWALProhibitState(), I know this will add a performance penalty
but this is for the short period until we are in the intermediate
state. After that as soon as it will set to
WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
LocalXLogInsertAllowed to 1.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#122

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#121)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, 11 May 2021 at 7:50 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 6:56 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

I might be missing something, but assume the behavior should be

like this

1. If the state is getting changed from

WALPROHIBIT_STATE_READ_WRITE

-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write

transaction(and

stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we

can

safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

Yes, that's what the current patch is doing from the first patch

version.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the

backend

to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as

described in

my previous reply.

Okay, but while browsing the code I do not see any direct if condition
based on the "LocalXLogInsertAllowed" variable, can you point me to
some references?
I only see one if check on this variable and that is in
XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
you are already checking IsWALProhibited. No?

I am not sure I understood this. Where am I checking IsWALProhibited()?

IsWALProhibited() is called by XLogInsertAllowed() once when
LocalXLogInsertAllowed is in a reset state, and that result will be
cached in LocalXLogInsertAllowed and will be used in the subsequent
XLogInsertAllowed() call.

Okay, got what you were trying to say. But that can be easily
fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
then what we can do is don't allow to write the WAL but let's not set
the LocalXLogInsertAllowed to 0. So until we are in the intermediate
state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
on GetWALProhibitState(), I know this will add a performance penalty
but this is for the short period until we are in the intermediate
state. After that as soon as it will set to
WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
LocalXLogInsertAllowed to 1.

I think I have much easier solution than this, will post that with update
version patch set tomorrow.

Regards,
Amul

#123

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#122)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote:

I think I have much easier solution than this, will post that with update version patch set tomorrow.

I don't know what you have in mind, but based on this discussion, it
seems to me that we should just have 5 states instead of 4:

1. WAL is permitted.
2. WAL is being prohibited but some backends may not know about the change yet.
3. WAL is prohibited.
4. WAL is in the process of being permitted but XLogAcceptWrites() may
not have been called yet.
5. WAL is in the process of being permitted and XLogAcceptWrites() has
been called but some backends may not know about the change yet.

If we're in state #3 and someone does pg_prohibit_wal(false) then we
enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
state #5, and pushes out a barrier. Then it waits for the barrier to
be absorbed and, when it has been, it moves us to state #1. Then if
someone does pg_prohibit_wal(true) we move to state #2. The
checkpointer pushes out a barrier and waits for it to be absorbed.
Then it calls XLogFlush() and afterward moves us to state #3.

We can have any (reasonable) number of states that we want. There's
nothing magical about 4.

I also entirely agree with Dilip that we should do some renaming to
get rid of the read-write/read-only terminology, now that this is no
longer part of the syntax. In fact I made the exact same point in my
last review. The WALPROHIBIT_STATE_* constants are just one thing of
many that needs to be included in that renaming.

--
Robert Haas
EDB: http://www.enterprisedb.com

#124

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Robert Haas (#123)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, May 11, 2021 at 11:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote:

I think I have much easier solution than this, will post that with update version patch set tomorrow.

I don't know what you have in mind, but based on this discussion, it
seems to me that we should just have 5 states instead of 4:

1. WAL is permitted.
2. WAL is being prohibited but some backends may not know about the change yet.
3. WAL is prohibited.
4. WAL is in the process of being permitted but XLogAcceptWrites() may
not have been called yet.
5. WAL is in the process of being permitted and XLogAcceptWrites() has
been called but some backends may not know about the change yet.

If we're in state #3 and someone does pg_prohibit_wal(false) then we
enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
state #5, and pushes out a barrier. Then it waits for the barrier to
be absorbed and, when it has been, it moves us to state #1. Then if
someone does pg_prohibit_wal(true) we move to state #2. The
checkpointer pushes out a barrier and waits for it to be absorbed.
Then it calls XLogFlush() and afterward moves us to state #3.

We can have any (reasonable) number of states that we want. There's
nothing magical about 4.

Your idea makes sense, but IMHO, if we are first writing
XLogAcceptWrites() and then pushing out the barrier, then I don't
understand the meaning of having state #4. I mean whenever any
backend receives the barrier the system will always be in state #5.
So what do we want to do with state #4?

Is it just to make the state machine better? I mean in the checkpoint
process, we don't need separate "if checks" whether the
XLogAcceptWrites() is called or not, instead we can just rely on the
state, if it is #4 then we have to call XLogAcceptWrites(). If so
then I think it's okay to have an additional state, just wanted to
know what idea you had in mind?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#125

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#124)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, May 12, 2021 at 11:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 11, 2021 at 11:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote:

I think I have much easier solution than this, will post that with update version patch set tomorrow.

I don't know what you have in mind, but based on this discussion, it
seems to me that we should just have 5 states instead of 4:

I had to have two different ideas, the first one is a little bit
aligned with the approach you mentioned below but without introducing
a new state. Basically, what we want is to restrict any backend that
connects to the server and write a WAL record while we are doing
XLogAcceptWrites(). For XLogAcceptWrites() skip we do already have a
flag for that, when that flag is set (i.e. XLogAcceptWrites() skipped
previously) then treat the system as read-only (i.e. WAL prohibited)
until XLogAcceptWrites() finishes. In that case, our IsWALProhibited()
function will be:

bool
IsWALProhibited(void)
{
WALProhibitState cur_state;

/*
* If essential operations are needed to enable wal writes are skipped
* previously then treat this state as WAL prohibited until that gets
* done.
*/
if (unlikely(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_SKIPPED))
return true;

cur_state = GetWALProhibitState(GetWALProhibitCounter());

return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
}

Another idea that I want to propose & did the changes according to in
the attached version is making IsWALProhibited() something like this:

bool
IsWALProhibited(void)
{
/* Other than read-write state will be considered as read-only */
return (GetWALProhibitState(GetWALProhibitCounter()) !=
WALPROHIBIT_STATE_READ_WRITE);
}

But this needs some additional changes to CompleteWALProhibitChange()
function where the final in-memory system state update happens
differently i.e. before or after emitting a global barrier.

When in-memory WAL prohibited state is _GOING_READ_WRITE then
in-memory state immediately changes to _READ_WRITE. After that global
barrier is emitted for other backends to change their local state.
This should be harmless because a _READ_WRITE system could have
_READ_ONLY and _READ_WRITE backends.

But when the in-memory WAL prohibited state is _GOING_READ_ONLY then
in-memory update for the final state setting is not going to happen
before the global barrier. We cannot say the system is _READ_ONLY
until we ensure that all backends are _READ_ONLY.

For more details please have a look at CompleteWALProhibitChange().
Note that XLogAcceptWrites() happens before
CompleteWALProhibitChange() so if any backend connect while
XLogAcceptWrites() is in progress and will not allow WAL writes until it
gets finished and CompleteWALProhibitChange() executed.

The second approach is much better, IMO, because IsWALProhibited() is
much lighter which would run a number of times when a new backend
connects and/or its LocalXLogInsertAllowed cached value gets reset.
Perhaps, you could argue that the number of calls might not be that
much due to the locally cached value in LocalXLogInsertAllowed, but I
am in favour of having less work.

Apart from this, I made a separate patch for XLogAcceptWrites()
refactoring. Now, each patch can be compiled without having the next
patch on top of it.

1. WAL is permitted.
2. WAL is being prohibited but some backends may not know about the change yet.
3. WAL is prohibited.
4. WAL is in the process of being permitted but XLogAcceptWrites() may
not have been called yet.
5. WAL is in the process of being permitted and XLogAcceptWrites() has
been called but some backends may not know about the change yet.

If we're in state #3 and someone does pg_prohibit_wal(false) then we
enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
state #5, and pushes out a barrier. Then it waits for the barrier to
be absorbed and, when it has been, it moves us to state #1. Then if
someone does pg_prohibit_wal(true) we move to state #2. The
checkpointer pushes out a barrier and waits for it to be absorbed.
Then it calls XLogFlush() and afterward moves us to state #3.

We can have any (reasonable) number of states that we want. There's
nothing magical about 4.

Your idea makes sense, but IMHO, if we are first writing
XLogAcceptWrites() and then pushing out the barrier, then I don't
understand the meaning of having state #4. I mean whenever any
backend receives the barrier the system will always be in state #5.
So what do we want to do with state #4?

Is it just to make the state machine better? I mean in the checkpoint
process, we don't need separate "if checks" whether the
XLogAcceptWrites() is called or not, instead we can just rely on the
state, if it is #4 then we have to call XLogAcceptWrites(). If so
then I think it's okay to have an additional state, just wanted to
know what idea you had in mind?

AFAICU, that proposed state #4 is to restrict the newly connected
backend from WAL writes. My first approach doing the same by changing
IsWALProhibited() a bit.

Regards,
Amul

Attachments:

v27-0004-Documentation.patchapplication/x-patch; name=v27-0004-Documentation.patchDownload

From 09d8f61e2f60aa4b596b3ff05dee32903a995fea Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v27 4/4] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d1f1794ca3..7e5e12c5aed 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24882,9 +24882,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25018,6 +25018,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         is emitted and <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..d761c7c1cad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v27-0002-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v27-0002-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 26f38dc627adb4fb711ea29105ac678274a589d0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v27 2/4] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 480 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 182 +++++++--
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  58 +++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 855 insertions(+), 79 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..c727775b017
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,480 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+static uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/*
+	 * WAL prohibit state changes not allowed during recovery except the crash
+	 * recovery case.
+	 */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ *
+ * Based on the final WAL prohibited state to be transit, the in-memory state
+ * update decided to do before or after emitting global Barrier.
+ *
+ * The idea behind this is that when we say the system is WAL prohibited, then
+ * WAL writes in all the backend should be prohibited, but when the system is no
+ * longer WAL prohibited, then it is not necessary to take out all backend from
+ * WAL prohibited state.  No harm if we let those backend run as read-only for
+ * some more time until we emit the barrier since those might have connected
+ * when the system was in WAL prohibited state and might doing a read-only
+ * operation. Those who might connect now onward can immediately start
+ * read-write operations.
+ *
+ * Therefore, while moving the system to WAL is no longer prohibited, then set
+ * update system state immediately and emit barrier later. But, while moving the
+ * system to WAL prohibited then we emit the global barrier first to ensure that
+ * no backend do the WAL writes before we set system state to WAL prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+
+		/* We are done */
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ * Increment wal prohibit counter by 1
+ */
+static uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e8..6c609e0e4b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8af45ac1a33..82ac0bc712a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -981,9 +989,6 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
-static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-							 TimeLineID EndOfLogTLI);
-
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -5227,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6250,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6611,13 +6627,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7896,7 +7929,31 @@ StartupXLOG(void)
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
-	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(InRecovery);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+		promoted = XLogAcceptWrites(InRecovery, xlogreader, EndOfLog, EndOfLogTLI);
 
 	/*
 	 * Okay, we're officially UP.
@@ -7953,14 +8010,29 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
-static bool
-XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-				 TimeLineID EndOfLogTLI)
+bool
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
 {
 	bool		promoted = false;
 
-	/* Only Startup or standalone backend allowed to be here. */
-	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
@@ -7970,7 +8042,7 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -8126,9 +8198,40 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * If we are in the Checkpointer process, we need to need to update DBState
+	 * explicitly like the startup process because end-of-recovery checkpoint
+	 * would set db state to shutdown.
+	 */
+	if (AmCheckpointerProcess())
+	{
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->state = DB_IN_PRODUCTION;
+		ControlFile->time = (pg_time_t) time(NULL);
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8343,9 +8446,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8364,9 +8467,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8388,6 +8502,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8677,9 +8797,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8692,6 +8816,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -8941,8 +9068,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 1b2b37c1bf0..9918c39a40d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -664,6 +664,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d516df0ac5c..ec64394e81b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -701,10 +701,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a4594..76affb7b549 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eac68951414..77950e04600 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..01a40d805ff 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614db..46f5f82c6a7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 									path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1a8fc167733..6996dac317a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 89b5b8b7b9d..5e33d878ce9 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -726,6 +726,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0a180341c22..0596cf617a5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -233,6 +234,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -641,6 +643,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2101,6 +2104,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12499,4 +12514,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..2fad0629202
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,58 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..3bccd8c8c1f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern bool XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 26c3fc0f6ba..336b1f626ea 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11563,6 +11563,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 47accc5ffe2..624ee55ff3e 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -224,7 +224,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1196febfa25..333145da7ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2708,6 +2708,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v27-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchapplication/x-patch; name=v27-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchDownload

From d9f0763b29fdfabe4c6c645e5b9be8febb6174af Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 11 May 2021 04:07:59 -0400
Subject: [PATCH v27 1/4] Refactor: separate WAL writing code from
 StartupXLOG().

Introduced a new function as XLogAcceptWrites() and moved following
code from StartupXLOG():

1. UpdateFullPageWrites(),
2. The following block of code that does either
   CreateEndOfRecoveryRecord() or RequestCheckpoint() or
   CreateCheckPoint(),
3. The next block of code that runs recovery_end_command,
4. XLogReportParameters(), and
5. CompleteCommitTsInitialization().

This function XLogAcceptWrites() planned to call from the place where
XLogReportParameters() was in StartupXLOG().

Now, InRecovery flag will be reset after XLogAcceptWrites() call, and
due to this assertion from SetMultiXactIdLimit() need to removed since
that function get called via TrimMultiXact() before InRecovery get
reset.
---
 src/backend/access/transam/multixact.c |   2 -
 src/backend/access/transam/xlog.c      | 220 ++++++++++++++-----------
 2 files changed, 124 insertions(+), 98 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa10..ec742f86b50 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2290,8 +2290,6 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	if (!MultiXactState->finishedStartup)
 		return;
 
-	Assert(!InRecovery);
-
 	/* Set limits for offset vacuum. */
 	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1d4415a433..8af45ac1a33 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -981,6 +981,9 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+							 TimeLineID EndOfLogTLI);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -7850,11 +7853,119 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will be
+	 * written later in XLogAcceptWrites().
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * All done with end-of-recovery actions.
+	 *
+	 * Now allow backends to write WAL and update the control file status in
+	 * consequence.  SharedRecoveryState, that controls if backends can write
+	 * WAL, is updated while holding ControlFileLock to prevent other backends
+	 * to look at an inconsistent state of the control file in shared memory.
+	 * There is still a small window during which backends can write WAL and
+	 * the control file is still referring to a system not in DB_IN_PRODUCTION
+	 * state while looking at the on-disk control file.
+	 *
+	 * Also, we use info_lck to update SharedRecoveryState to ensure that
+	 * there are no race conditions concerning visibility of other recent
+	 * updates to shared memory.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+static bool
+XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+				 TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7877,15 +7988,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7916,6 +8032,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7995,57 +8113,6 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -8059,46 +8126,7 @@ StartupXLOG(void)
 	 */
 	CompleteCommitTsInitialization();
 
-	/*
-	 * All done with end-of-recovery actions.
-	 *
-	 * Now allow backends to write WAL and update the control file status in
-	 * consequence.  SharedRecoveryState, that controls if backends can write
-	 * WAL, is updated while holding ControlFileLock to prevent other backends
-	 * to look at an inconsistent state of the control file in shared memory.
-	 * There is still a small window during which backends can write WAL and
-	 * the control file is still referring to a system not in DB_IN_PRODUCTION
-	 * state while looking at the on-disk control file.
-	 *
-	 * Also, we use info_lck to update SharedRecoveryState to ensure that
-	 * there are no race conditions concerning visibility of other recent
-	 * updates to shared memory.
-	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	ControlFile->state = DB_IN_PRODUCTION;
-	ControlFile->time = (pg_time_t) time(NULL);
-
-	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
-	SpinLockRelease(&XLogCtl->info_lck);
-
-	UpdateControlFile();
-	LWLockRelease(ControlFileLock);
-
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This
-	 * isn't required for consistency, but the last restartpoint might be far
-	 * back, and in case of a crash, recovering from it might take a longer
-	 * than is appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	return promoted;
 }
 
 /*
-- 
2.18.0

v27-0003-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v27-0003-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 7e6c6036123c2967d2ead33269e23acafb0abd94 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v27 3/4] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++--
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 +++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 46 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 27 +++++++++++++
 40 files changed, 504 insertions(+), 68 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c23ea44866a..75d808a460b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index bab2a88ee3f..70d846d5121 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -404,6 +406,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -415,7 +421,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -613,6 +619,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ba36da2b83c..c0b8ccd1755 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2132,6 +2133,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2455,6 +2458,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -3015,6 +3020,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3770,6 +3777,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3954,6 +3963,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4886,6 +4897,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5676,6 +5689,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5834,6 +5849,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5942,6 +5959,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6062,6 +6081,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6092,6 +6112,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6102,7 +6126,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 0c8e49d3e6c..bd2cf50f20a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 47ac6385d12..a4436c5ab4b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1301,6 +1302,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1316,8 +1322,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1928,8 +1933,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1954,7 +1964,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2387,6 +2397,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2397,6 +2408,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2427,7 +2441,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 706e16ae949..44523f3c26c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f06..df9506b7cd5 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1130,6 +1135,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1538,6 +1545,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1624,6 +1633,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1809,6 +1820,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ec742f86b50..4d215b1c8dc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 46f3d082492..eb9174cedf9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 142da4aaff3..ad421426955 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index c727775b017..48a06709cb0 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6c609e0e4b4..b0363e82c77 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 82ac0bc712a..171f43820fb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9326,6 +9328,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9483,6 +9488,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10135,7 +10142,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10149,10 +10156,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10174,8 +10181,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245ee..a925cc9d5cd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 833c7f5139b..8ce2eefd52e 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 		/* Can't go to r/w mode while WAL is prohibited */
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76affb7b549..ec05110b571 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0c5b87864b9..8335382870f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3904,13 +3904,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 2fad0629202..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -55,4 +55,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af5..f18232cbf53 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -97,12 +97,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -112,6 +137,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -138,6 +164,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#126

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#124)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, May 12, 2021 at 1:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Your idea makes sense, but IMHO, if we are first writing
XLogAcceptWrites() and then pushing out the barrier, then I don't
understand the meaning of having state #4. I mean whenever any
backend receives the barrier the system will always be in state #5.
So what do we want to do with state #4?

Well, if you don't have that, how does the checkpointer know that it's
supposed to push out the barrier?

You and Amul both seem to want to merge states #4 and #5. But how to
make that work? Basically what you are both saying is that, after we
move into the "going read-write" state, backends aren't immediately
told that they can write WAL, but have to keep checking back. But this
could be expensive. If you have one state that means that the
checkpointer has been requested to run XLogAcceptWrites() and push out
a barrier, and another state to mean that it has done so, then you
avoid that. Maybe that overhead wouldn't be large anyway, but it seems
like it's only necessary because you're trying to merge two states
which, from a logical point of view, are separate.

--
Robert Haas
EDB: http://www.enterprisedb.com

#127

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Robert Haas (#126)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, May 13, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 12, 2021 at 1:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Your idea makes sense, but IMHO, if we are first writing
XLogAcceptWrites() and then pushing out the barrier, then I don't
understand the meaning of having state #4. I mean whenever any
backend receives the barrier the system will always be in state #5.
So what do we want to do with state #4?

Well, if you don't have that, how does the checkpointer know that it's
supposed to push out the barrier?

You and Amul both seem to want to merge states #4 and #5. But how to
make that work? Basically what you are both saying is that, after we
move into the "going read-write" state, backends aren't immediately
told that they can write WAL, but have to keep checking back. But this
could be expensive. If you have one state that means that the
checkpointer has been requested to run XLogAcceptWrites() and push out
a barrier, and another state to mean that it has done so, then you
avoid that. Maybe that overhead wouldn't be large anyway, but it seems
like it's only necessary because you're trying to merge two states
which, from a logical point of view, are separate.

I don't have an objection to having 5 states, just wanted to
understand your reasoning. So it makes sense to me. Thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#128

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#125)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote:

Thanks for the updated patch, while going through I noticed this comment.

+ /*
+ * WAL prohibit state changes not allowed during recovery except the crash
+ * recovery case.
+ */
+ PreventCommandDuringRecovery("pg_prohibit_wal()");

Why do we need to allow state change during recovery? Do you still
need it after the latest changes you discussed here, I mean now
XLogAcceptWrites() being called before sending barrier to backends.
So now we are not afraid that the backend will write WAL before we
call XLogAcceptWrites(). So now IMHO, we don't need to keep the
system in recovery until pg_prohibit_wal(false) is called, right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#129

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#128)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, May 13, 2021 at 12:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote:

Thanks for the updated patch, while going through I noticed this comment.
+ /*
+ * WAL prohibit state changes not allowed during recovery except the crash
+ * recovery case.
+ */
+ PreventCommandDuringRecovery("pg_prohibit_wal()");
Why do we need to allow state change during recovery? Do you still
need it after the latest changes you discussed here, I mean now
XLogAcceptWrites() being called before sending barrier to backends.
So now we are not afraid that the backend will write WAL before we
call XLogAcceptWrites(). So now IMHO, we don't need to keep the
system in recovery until pg_prohibit_wal(false) is called, right?

Your understanding is correct, and the previous patch also does the same, but
the code comment is wrong. Fixed in the attached version, also rebased for the
latest master head. Sorry for the confusion.

Regards,
Amul

Attachments:

v28-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchapplication/x-patch; name=v28-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchDownload

From ce47a27e06a8b76fcdbf22b39e5aae786d307318 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 11 May 2021 04:07:59 -0400
Subject: [PATCH v28 1/4] Refactor: separate WAL writing code from
 StartupXLOG().

Introduced a new function as XLogAcceptWrites() and moved following
code from StartupXLOG():

1. UpdateFullPageWrites(),
2. The following block of code that does either
   CreateEndOfRecoveryRecord() or RequestCheckpoint() or
   CreateCheckPoint(),
3. The next block of code that runs recovery_end_command,
4. XLogReportParameters(), and
5. CompleteCommitTsInitialization().

This function XLogAcceptWrites() planned to call from the place where
XLogReportParameters() was in StartupXLOG().

Now, InRecovery flag will be reset after XLogAcceptWrites() call, and
due to this assertion from SetMultiXactIdLimit() need to removed since
that function get called via TrimMultiXact() before InRecovery get
reset.
---
 src/backend/access/transam/multixact.c |   2 -
 src/backend/access/transam/xlog.c      | 220 ++++++++++++++-----------
 2 files changed, 124 insertions(+), 98 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index daab546f296..b66d802c86e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2290,8 +2290,6 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	if (!MultiXactState->finishedStartup)
 		return;
 
-	Assert(!InRecovery);
-
 	/* Set limits for offset vacuum. */
 	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f190f3..0b9888ce59d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -981,6 +981,9 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+							 TimeLineID EndOfLogTLI);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -7852,11 +7855,119 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will be
+	 * written later in XLogAcceptWrites().
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * All done with end-of-recovery actions.
+	 *
+	 * Now allow backends to write WAL and update the control file status in
+	 * consequence.  SharedRecoveryState, that controls if backends can write
+	 * WAL, is updated while holding ControlFileLock to prevent other backends
+	 * to look at an inconsistent state of the control file in shared memory.
+	 * There is still a small window during which backends can write WAL and
+	 * the control file is still referring to a system not in DB_IN_PRODUCTION
+	 * state while looking at the on-disk control file.
+	 *
+	 * Also, we use info_lck to update SharedRecoveryState to ensure that
+	 * there are no race conditions concerning visibility of other recent
+	 * updates to shared memory.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+static bool
+XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+				 TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7879,15 +7990,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7918,6 +8034,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7997,57 +8115,6 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -8061,46 +8128,7 @@ StartupXLOG(void)
 	 */
 	CompleteCommitTsInitialization();
 
-	/*
-	 * All done with end-of-recovery actions.
-	 *
-	 * Now allow backends to write WAL and update the control file status in
-	 * consequence.  SharedRecoveryState, that controls if backends can write
-	 * WAL, is updated while holding ControlFileLock to prevent other backends
-	 * to look at an inconsistent state of the control file in shared memory.
-	 * There is still a small window during which backends can write WAL and
-	 * the control file is still referring to a system not in DB_IN_PRODUCTION
-	 * state while looking at the on-disk control file.
-	 *
-	 * Also, we use info_lck to update SharedRecoveryState to ensure that
-	 * there are no race conditions concerning visibility of other recent
-	 * updates to shared memory.
-	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	ControlFile->state = DB_IN_PRODUCTION;
-	ControlFile->time = (pg_time_t) time(NULL);
-
-	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
-	SpinLockRelease(&XLogCtl->info_lck);
-
-	UpdateControlFile();
-	LWLockRelease(ControlFileLock);
-
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	return promoted;
 }
 
 /*
-- 
2.18.0

v28-0004-Documentation.patchapplication/x-patch; name=v28-0004-Documentation.patchDownload

From 5ce8133157f0a34d9b47753112e9e5290c344524 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v28 4/4] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d1f1794ca3..7e5e12c5aed 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24882,9 +24882,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25018,6 +25018,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         is emitted and <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to read-only
+        (WAL prohibited state), if that not already. When
+        <literal>false</literal> passed, system state changed to read-write
+        (WAL permitted state), if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..d761c7c1cad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a read-only mode where insert write ahead log will be prohibited until the
+    same function executed to change that state to read-write. Like Hot Standby,
+    connections to the server are allowed to run read-only queries in WAL
+    prohibited state.  If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the read-only system will skip the shutdown checkpoint, and at
+    the restart, it will go into crash recovery mode and stay in that state
+    until the system changed to read-write. At starting read-only server if it
+    finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    read-only state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..74c965b1f19 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+Read only system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive read-only system state transition barrier interrupt need
+to stop WAL writing immediately.  For barrier absorption the backed(s) will kill
+the running transaction which has valid XID indicates that the transaction has
+performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in read only system state.  To prevent such error from XLogBeginInsert() inside
+the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermitted_HaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermitted_HaveXID.  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the read-only state and may
+	or may not have XID, but need to ensure the permission has been checked on
+	assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v28-0003-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v28-0003-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 91f4927a36e2cca92b12e5af302476829675cdee Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v28 3/4] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 12 ++++--
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 13 ++++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/commands/variable.c           |  9 +++--
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 46 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 27 +++++++++++++
 40 files changed, 504 insertions(+), 68 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..8c672770e79 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ac07f2fdac..592eecf4409 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2132,6 +2133,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2455,6 +2458,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -3015,6 +3020,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3773,6 +3780,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3957,6 +3966,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4889,6 +4900,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5679,6 +5692,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5837,6 +5852,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5945,6 +5962,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6065,6 +6084,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6095,6 +6115,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6105,7 +6129,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..fd03ec0c65e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * We can't write WAL during read-only mode, so there's no point trying to
 	 * clean the page. The primary will likely issue a cleaning WAL record
 	 * soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 17519a970fe..1414bab3ab8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1301,6 +1302,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1316,8 +1322,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1928,8 +1933,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1954,7 +1964,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2387,6 +2397,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2397,6 +2408,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2427,7 +2441,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 854e3b2cf9a..71f5ade8c28 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 4d380c99f06..df9506b7cd5 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1130,6 +1135,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1538,6 +1545,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1624,6 +1633,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1809,6 +1820,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b66d802c86e..b40af2937d9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813c564..ddbdda8b79d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a22bf375f85..396de62de64 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id in read-only mode */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index d1bbecd1680..ada047fda35 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALPermitCheckState walpermit_checked_state = WALPERMIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6c609e0e4b4..b0363e82c77 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bac21ee294b..3844d9a8f60 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9328,6 +9330,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9485,6 +9490,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10137,7 +10144,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10151,10 +10158,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10176,8 +10183,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 32b4cc84e79..8e9bfb6e9da 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +216,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPERMIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 833c7f5139b..8ce2eefd52e 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -501,11 +501,14 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("transaction read-write mode must be set before any query");
 			return false;
 		}
-		/* Can't go to r/w mode while recovery is still active */
-		if (RecoveryInProgress())
+		/*
+		 * Can't go to r/w mode while recovery is still active or while in WAL
+		 * prohibit state
+		 */
+		if (!XLogInsertAllowed())
 		{
 			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
+			GUC_check_errmsg("cannot set transaction read-write mode while system is read only");
 			return false;
 		}
 		/* Can't go to r/w mode while WAL is prohibited */
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c0d805313dc..095aff632aa 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a22c45..57d05e094c6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3904,13 +3904,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 2fad0629202..807fbd45273 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -55,4 +55,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("system is now read only")));
+
+#ifdef USE_ASSERT_CHECKING
+	walpermit_checked_state = WALPERMIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95202d37af5..f18232cbf53 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -97,12 +97,37 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPERMIT_UNCHECKED,
+	WALPERMIT_CHECKED,
+	WALPERMIT_CHECKED_AND_USED
+} WALPermitCheckState;
+
+/* in access/walprohibit.c */
+extern WALPermitCheckState walpermit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPERMIT_CHECKED_STATE() \
+do { \
+	walpermit_checked_state = CritSectionCount ? \
+	WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
+} while(0)
+#else
+#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
+#endif
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #else							/* WIN32 */
 
@@ -112,6 +137,7 @@ do { \
 		pgwin32_dispatch_queued_signals(); \
 	if (unlikely(InterruptPending)) \
 		ProcessInterrupts(); \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 #endif							/* WIN32 */
 
@@ -138,6 +164,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPERMIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v28-0002-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v28-0002-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 1ccf353e402e50025cf39cd14c0e299672840e55 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v28 2/4] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 477 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 182 +++++++--
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  58 +++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 852 insertions(+), 79 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..d1bbecd1680
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,477 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+static uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ *
+ * Based on the final WAL prohibited state to be transit, the in-memory state
+ * update decided to do before or after emitting global Barrier.
+ *
+ * The idea behind this is that when we say the system is WAL prohibited, then
+ * WAL writes in all the backend should be prohibited, but when the system is no
+ * longer WAL prohibited, then it is not necessary to take out all backend from
+ * WAL prohibited state.  No harm if we let those backend run as read-only for
+ * some more time until we emit the barrier since those might have connected
+ * when the system was in WAL prohibited state and might doing a read-only
+ * operation. Those who might connect now onward can immediately start
+ * read-write operations.
+ *
+ * Therefore, while moving the system to WAL is no longer prohibited, then set
+ * update system state immediately and emit barrier later. But, while moving the
+ * system to WAL prohibited then we emit the global barrier first to ensure that
+ * no backend do the WAL writes before we set system state to WAL prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+
+		/* We are done */
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ * Increment wal prohibit counter by 1
+ */
+static uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e8..6c609e0e4b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0b9888ce59d..bac21ee294b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -981,9 +989,6 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
-static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-							 TimeLineID EndOfLogTLI);
-
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -5227,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6250,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6615,13 +6631,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7898,7 +7931,31 @@ StartupXLOG(void)
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
-	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(InRecovery);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+		promoted = XLogAcceptWrites(InRecovery, xlogreader, EndOfLog, EndOfLogTLI);
 
 	/*
 	 * Okay, we're officially UP.
@@ -7955,14 +8012,29 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
-static bool
-XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-				 TimeLineID EndOfLogTLI)
+bool
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
 {
 	bool		promoted = false;
 
-	/* Only Startup or standalone backend allowed to be here. */
-	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
@@ -7972,7 +8044,7 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -8128,9 +8200,40 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * If we are in the Checkpointer process, we need to need to update DBState
+	 * explicitly like the startup process because end-of-recovery checkpoint
+	 * would set db state to shutdown.
+	 */
+	if (AmCheckpointerProcess())
+	{
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->state = DB_IN_PRODUCTION;
+		ControlFile->time = (pg_time_t) time(NULL);
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8345,9 +8448,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8366,9 +8469,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8390,6 +8504,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8679,9 +8799,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8694,6 +8818,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -8943,8 +9070,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a4373b176c6..dbc295730e9 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -704,6 +704,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d516df0ac5c..ec64394e81b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -701,10 +701,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a01..c0d805313dc 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..01a40d805ff 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index bc3ceb27125..417662d28a1 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1a8fc167733..6996dac317a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6baf67740c7..5a21fbbbbfa 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -726,6 +726,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eb7f7181e43..dc95897fb56 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -233,6 +234,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -641,6 +643,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2101,6 +2104,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12500,4 +12515,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..2fad0629202
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,58 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..3bccd8c8c1f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern bool XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index acbcae46070..c2298c8c45b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11578,6 +11578,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6c6ec2e7118..e95b3197cc5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -224,7 +224,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb08319ca..b8f2e22d7e6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2816,6 +2816,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

#130

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#129)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, May 13, 2021 at 2:54 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, May 13, 2021 at 12:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote:

Thanks for the updated patch, while going through I noticed this comment.
+ /*
+ * WAL prohibit state changes not allowed during recovery except the crash
+ * recovery case.
+ */
+ PreventCommandDuringRecovery("pg_prohibit_wal()");
Why do we need to allow state change during recovery? Do you still
need it after the latest changes you discussed here, I mean now
XLogAcceptWrites() being called before sending barrier to backends.
So now we are not afraid that the backend will write WAL before we
call XLogAcceptWrites(). So now IMHO, we don't need to keep the
system in recovery until pg_prohibit_wal(false) is called, right?
Your understanding is correct, and the previous patch also does the same, but
the code comment is wrong. Fixed in the attached version, also rebased for the
latest master head. Sorry for the confusion.

Great thanks. I will review the remaining patch soon.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#131

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#130)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Great thanks. I will review the remaining patch soon.

I have reviewed v28-0003, and I have some comments on this.

===
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
Assert(mainrdata_len == 0);

+    /*
+     * WAL permission must have checked before entering the critical section.
+     * Otherwise, WAL prohibited error will force system panic.
+     */
+    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
!CritSectionCount);
+
     /* cross-check on whether we should be here or not */
-    if (!XLogInsertAllowed())
-        elog(ERROR, "cannot make new WAL entries during recovery");
+    CheckWALPermitted();

We must not call CheckWALPermitted inside the critical section,
instead if we are here we must be sure that
WAL is permitted, so better put an assert. Even if that is ensured by
some other mean then also I don't
see any reason for calling this error generating function.

===

+CheckWALPermitted(void)
+{
+    if (!XLogInsertAllowed())
+        ereport(ERROR,
+                (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+                 errmsg("system is now read only")));
+

system is now read only -> wal is prohibited (in error message)

===

-     * We can't write WAL in recovery mode, so there's no point trying to
+     * We can't write WAL during read-only mode, so there's no point trying to

during read-only mode -> if WAL is prohibited or WAL recovery in
progress (add recovery in progress and also modify read-only to wal
prohibited)

===

+        if (!XLogInsertAllowed())
         {
             GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-            GUC_check_errmsg("cannot set transaction read-write mode
during recovery");
+            GUC_check_errmsg("cannot set transaction read-write mode
while system is read only");
             return false;
         }

system is read only -> WAL is prohibited

===

I think that's all, I have to say about 0003.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#132

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#131)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Great thanks. I will review the remaining patch soon.

I have reviewed v28-0003, and I have some comments on this.

===
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
Assert(mainrdata_len == 0);
+    /*
+     * WAL permission must have checked before entering the critical section.
+     * Otherwise, WAL prohibited error will force system panic.
+     */
+    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
!CritSectionCount);
+
/* cross-check on whether we should be here or not */
-    if (!XLogInsertAllowed())
-        elog(ERROR, "cannot make new WAL entries during recovery");
+    CheckWALPermitted();
We must not call CheckWALPermitted inside the critical section,
instead if we are here we must be sure that
WAL is permitted, so better put an assert. Even if that is ensured by
some other mean then also I don't
see any reason for calling this error generating function.

I understand that we should not have an error inside a critical section but
this check is not wrong. Patch has enough checking so that errors due to WAL
prohibited state must not hit in the critical section, see assert just before
CheckWALPermitted(). Before entering into the critical section, we do have an
explicit WAL prohibited check. And to make sure that check has been done for
all current critical section for the wal writes, we have aforesaid assert
checking, for more detail on this please have a look at the "WAL prohibited
system state" section of src/backend/access/transam/README added in 0004 patch.
This assertion also ensures that future development does not miss the WAL
prohibited state check before entering into a newly added critical section for
WAL writes.

===

+CheckWALPermitted(void)
+{
+    if (!XLogInsertAllowed())
+        ereport(ERROR,
+                (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+                 errmsg("system is now read only")));
+

system is now read only -> wal is prohibited (in error message)

===

-     * We can't write WAL in recovery mode, so there's no point trying to
+     * We can't write WAL during read-only mode, so there's no point trying to

during read-only mode -> if WAL is prohibited or WAL recovery in
progress (add recovery in progress and also modify read-only to wal
prohibited)

===

+        if (!XLogInsertAllowed())
{
GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-            GUC_check_errmsg("cannot set transaction read-write mode
during recovery");
+            GUC_check_errmsg("cannot set transaction read-write mode
while system is read only");
return false;
}

system is read only -> WAL is prohibited

===

Fixed all in the attached version.

I think that's all, I have to say about 0003.

Thanks for the review.

Regards,
Amul

Attachments:

v29-0003-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v29-0003-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From bdbb05be362a3400751fdb0991c91e5313181b78 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v29 3/4] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria added the Assert or the Error when system is
prohibited:

 - Added ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Added ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 +++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 14 ++++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 46 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 +++++++++++++
 39 files changed, 500 insertions(+), 67 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..8c672770e79 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ac07f2fdac..592eecf4409 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2132,6 +2133,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2455,6 +2458,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -3015,6 +3020,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3773,6 +3780,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3957,6 +3966,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4889,6 +4900,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5679,6 +5692,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5837,6 +5852,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5945,6 +5962,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -6065,6 +6084,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6095,6 +6115,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6105,7 +6129,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..0cb9adf8b5d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9f1f8e340d9..2c9d720149b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1304,6 +1305,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1319,8 +1325,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1931,8 +1936,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1957,7 +1967,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2390,6 +2400,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2400,6 +2411,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2430,7 +2444,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 271994b08df..99466b5a5a9 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b66d802c86e..b40af2937d9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813c564..ddbdda8b79d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a22bf375f85..fce9ad78afa 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index d1bbecd1680..ac843009213 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -26,6 +26,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6c609e0e4b4..b0363e82c77 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bac21ee294b..3844d9a8f60 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1037,7 +1037,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9328,6 +9330,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9485,6 +9490,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10137,7 +10144,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10151,10 +10158,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10176,8 +10183,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 32b4cc84e79..6b669f32c02 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,15 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +217,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c0d805313dc..095aff632aa 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a22c45..57d05e094c6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3904,13 +3904,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ffe98322ec5..8b4bc3cfec1 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -55,4 +55,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 4dc343cbc59..c708ecdc743 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v29-0002-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v29-0002-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From b0061fe532ad427156ea202c20b048af0a6f82db Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v29 2/4] Implement wal prohibit state using global barrier.

Implementation:

 1. When a user tried to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer, noticing that the current state transition, does the
    barrier request, and then acknowledges back to the backend who
    requested the state change once the transition has been completed.
    Final state will be updated in control file to make it persistent
    across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 477 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 182 +++++++--
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  58 +++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 852 insertions(+), 79 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..d1bbecd1680
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,477 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static uint32 GetWALProhibitCounter(void);
+static uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ * Handle WAL prohibit state change request.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ * SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ * Complete the requested WAL prohibit state transition.
+ *
+ * Based on the final WAL prohibited state to be transit, the in-memory state
+ * update decided to do before or after emitting global Barrier.
+ *
+ * The idea behind this is that when we say the system is WAL prohibited, then
+ * WAL writes in all the backend should be prohibited, but when the system is no
+ * longer WAL prohibited, then it is not necessary to take out all backend from
+ * WAL prohibited state.  No harm if we let those backend run as read-only for
+ * some more time until we emit the barrier since those might have connected
+ * when the system was in WAL prohibited state and might doing a read-only
+ * operation. Those who might connect now onward can immediately start
+ * read-write operations.
+ *
+ * Therefore, while moving the system to WAL is no longer prohibited, then set
+ * update system state immediately and emit barrier later. But, while moving the
+ * system to WAL prohibited then we emit the global barrier first to ensure that
+ * no backend do the WAL writes before we set system state to WAL prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter value */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called from checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then, that needs to be
+	 * completed. If the server crashes before the state completion, then the
+	 * control file information will be used to set final the final wal
+	 * prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * We don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+
+		/* We are done */
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ * Increment wal prohibit counter by 1
+ */
+static uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ *
+ * Checkpointer will complete wal prohibit state change request.
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ *
+ * Atomically return the current server WAL prohibited state counter.
+ */
+static uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ * Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e8..6c609e0e4b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0b9888ce59d..bac21ee294b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -248,9 +249,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -732,6 +734,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -981,9 +989,6 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
-static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-							 TimeLineID EndOfLogTLI);
-
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -5227,6 +5232,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6250,6 +6256,16 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Return value of xlogAllowWritesState.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Read the latest value */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6615,13 +6631,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7898,7 +7931,31 @@ StartupXLOG(void)
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
-	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(InRecovery);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+		promoted = XLogAcceptWrites(InRecovery, xlogreader, EndOfLog, EndOfLogTLI);
 
 	/*
 	 * Okay, we're officially UP.
@@ -7955,14 +8012,29 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
-static bool
-XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-				 TimeLineID EndOfLogTLI)
+bool
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
 {
 	bool		promoted = false;
 
-	/* Only Startup or standalone backend allowed to be here. */
-	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
@@ -7972,7 +8044,7 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -8128,9 +8200,40 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * If we are in the Checkpointer process, we need to need to update DBState
+	 * explicitly like the startup process because end-of-recovery checkpoint
+	 * would set db state to shutdown.
+	 */
+	if (AmCheckpointerProcess())
+	{
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		ControlFile->state = DB_IN_PRODUCTION;
+		ControlFile->time = (pg_time_t) time(NULL);
+		UpdateControlFile();
+		LWLockRelease(ControlFileLock);
+	}
+
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8345,9 +8448,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8366,9 +8469,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8390,6 +8504,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8679,9 +8799,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8694,6 +8818,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -8943,8 +9070,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a4373b176c6..dbc295730e9 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -704,6 +704,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d516df0ac5c..ec64394e81b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -701,10 +701,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a01..c0d805313dc 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..01a40d805ff 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index bc3ceb27125..417662d28a1 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1a8fc167733..6996dac317a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6baf67740c7..5a21fbbbbfa 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -726,6 +726,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ee731044b63..98cf3781135 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -234,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -658,6 +660,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2109,6 +2112,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12518,4 +12533,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ffe98322ec5
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,58 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..3bccd8c8c1f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -167,6 +167,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -315,6 +323,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -323,6 +332,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -334,6 +344,10 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern bool XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index acbcae46070..c2298c8c45b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11578,6 +11578,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6c6ec2e7118..e95b3197cc5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -224,7 +224,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb08319ca..b8f2e22d7e6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2816,6 +2816,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v29-0004-Documentation.patchapplication/x-patch; name=v29-0004-Documentation.patchDownload

From 3061dfd44593ada325454f42884f23554f716bb3 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v29 4/4] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3a21129021a..8d579de0778 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24883,9 +24883,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25019,6 +25019,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         is emitted and <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c072110ba60..a2884a8e615 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v29-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchapplication/x-patch; name=v29-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchDownload

From 502d173d81683c982368471f6c99d951734956ee Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 11 May 2021 04:07:59 -0400
Subject: [PATCH v29 1/4] Refactor: separate WAL writing code from
 StartupXLOG().

Introduced a new function as XLogAcceptWrites() and moved following
code from StartupXLOG():

1. UpdateFullPageWrites(),
2. The following block of code that does either
   CreateEndOfRecoveryRecord() or RequestCheckpoint() or
   CreateCheckPoint(),
3. The next block of code that runs recovery_end_command,
4. XLogReportParameters(), and
5. CompleteCommitTsInitialization().

This function XLogAcceptWrites() planned to call from the place where
XLogReportParameters() was in StartupXLOG().

Now, InRecovery flag will be reset after XLogAcceptWrites() call, and
due to this assertion from SetMultiXactIdLimit() need to removed since
that function get called via TrimMultiXact() before InRecovery get
reset.
---
 src/backend/access/transam/multixact.c |   2 -
 src/backend/access/transam/xlog.c      | 220 ++++++++++++++-----------
 2 files changed, 124 insertions(+), 98 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index daab546f296..b66d802c86e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2290,8 +2290,6 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	if (!MultiXactState->finishedStartup)
 		return;
 
-	Assert(!InRecovery);
-
 	/* Set limits for offset vacuum. */
 	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f190f3..0b9888ce59d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -981,6 +981,9 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+							 TimeLineID EndOfLogTLI);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -7852,11 +7855,119 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will be
+	 * written later in XLogAcceptWrites().
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * All done with end-of-recovery actions.
+	 *
+	 * Now allow backends to write WAL and update the control file status in
+	 * consequence.  SharedRecoveryState, that controls if backends can write
+	 * WAL, is updated while holding ControlFileLock to prevent other backends
+	 * to look at an inconsistent state of the control file in shared memory.
+	 * There is still a small window during which backends can write WAL and
+	 * the control file is still referring to a system not in DB_IN_PRODUCTION
+	 * state while looking at the on-disk control file.
+	 *
+	 * Also, we use info_lck to update SharedRecoveryState to ensure that
+	 * there are no race conditions concerning visibility of other recent
+	 * updates to shared memory.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+static bool
+XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+				 TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7879,15 +7990,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7918,6 +8034,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7997,57 +8115,6 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -8061,46 +8128,7 @@ StartupXLOG(void)
 	 */
 	CompleteCommitTsInitialization();
 
-	/*
-	 * All done with end-of-recovery actions.
-	 *
-	 * Now allow backends to write WAL and update the control file status in
-	 * consequence.  SharedRecoveryState, that controls if backends can write
-	 * WAL, is updated while holding ControlFileLock to prevent other backends
-	 * to look at an inconsistent state of the control file in shared memory.
-	 * There is still a small window during which backends can write WAL and
-	 * the control file is still referring to a system not in DB_IN_PRODUCTION
-	 * state while looking at the on-disk control file.
-	 *
-	 * Also, we use info_lck to update SharedRecoveryState to ensure that
-	 * there are no race conditions concerning visibility of other recent
-	 * updates to shared memory.
-	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	ControlFile->state = DB_IN_PRODUCTION;
-	ControlFile->time = (pg_time_t) time(NULL);
-
-	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
-	SpinLockRelease(&XLogCtl->info_lck);
-
-	UpdateControlFile();
-	LWLockRelease(ControlFileLock);
-
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	return promoted;
 }
 
 /*
-- 
2.18.0

#133

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#132)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, May 17, 2021 at 11:48 AM Amul Sul <sulamul@gmail.com> wrote:

On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Great thanks. I will review the remaining patch soon.

I have reviewed v28-0003, and I have some comments on this.

===
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
Assert(mainrdata_len == 0);
+    /*
+     * WAL permission must have checked before entering the critical section.
+     * Otherwise, WAL prohibited error will force system panic.
+     */
+    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
!CritSectionCount);
+
/* cross-check on whether we should be here or not */
-    if (!XLogInsertAllowed())
-        elog(ERROR, "cannot make new WAL entries during recovery");
+    CheckWALPermitted();
We must not call CheckWALPermitted inside the critical section,
instead if we are here we must be sure that
WAL is permitted, so better put an assert. Even if that is ensured by
some other mean then also I don't
see any reason for calling this error generating function.
I understand that we should not have an error inside a critical section but
this check is not wrong. Patch has enough checking so that errors due to WAL
prohibited state must not hit in the critical section, see assert just before
CheckWALPermitted(). Before entering into the critical section, we do have an
explicit WAL prohibited check. And to make sure that check has been done for
all current critical section for the wal writes, we have aforesaid assert
checking, for more detail on this please have a look at the "WAL prohibited
system state" section of src/backend/access/transam/README added in 0004 patch.
This assertion also ensures that future development does not miss the WAL
prohibited state check before entering into a newly added critical section for
WAL writes.

I think we need CheckWALPermitted(); check, in XLogBeginInsert()
function because if XLogBeginInsert() maybe called outside critical
section e.g. pg_truncate_visibility_map() then we should error out.
So this check make sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#134

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#133)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is rebase for the latest master head. Also, I added one more
refactoring code that deduplicates the code setting database state in the
control file. The same code set the database state is also needed for this
feature.

Regards.
Amul

Show quoted text

On Mon, May 17, 2021 at 1:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 17, 2021 at 11:48 AM Amul Sul <sulamul@gmail.com> wrote:
On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Great thanks. I will review the remaining patch soon.

I have reviewed v28-0003, and I have some comments on this.

===
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
Assert(mainrdata_len == 0);
+    /*
+     * WAL permission must have checked before entering the critical section.
+     * Otherwise, WAL prohibited error will force system panic.
+     */
+    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
!CritSectionCount);
+
/* cross-check on whether we should be here or not */
-    if (!XLogInsertAllowed())
-        elog(ERROR, "cannot make new WAL entries during recovery");
+    CheckWALPermitted();
We must not call CheckWALPermitted inside the critical section,
instead if we are here we must be sure that
WAL is permitted, so better put an assert. Even if that is ensured by
some other mean then also I don't
see any reason for calling this error generating function.
I understand that we should not have an error inside a critical section but
this check is not wrong. Patch has enough checking so that errors due to WAL
prohibited state must not hit in the critical section, see assert just before
CheckWALPermitted(). Before entering into the critical section, we do have an
explicit WAL prohibited check. And to make sure that check has been done for
all current critical section for the wal writes, we have aforesaid assert
checking, for more detail on this please have a look at the "WAL prohibited
system state" section of src/backend/access/transam/README added in 0004 patch.
This assertion also ensures that future development does not miss the WAL
prohibited state check before entering into a newly added critical section for
WAL writes.
I think we need CheckWALPermitted(); check, in XLogBeginInsert()
function because if XLogBeginInsert() maybe called outside critical
section e.g. pg_truncate_visibility_map() then we should error out.
So this check make sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v30-0002-Refactor-add-function-to-set-database-state-in-c.patchapplication/octet-stream; name=v30-0002-Refactor-add-function-to-set-database-state-in-c.patchDownload

From fe5f249a7fdb6f0999092a611ed39f1b4076a3e7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 16 Jun 2021 09:02:24 -0400
Subject: [PATCH v30 2/5] Refactor: add function to set database state in
 control file

---
 src/backend/access/transam/xlog.c | 29 ++++++++++++++---------------
 src/include/access/xlog.h         |  2 ++
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 445f13b28a3..ac50d567be9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -38,7 +38,6 @@
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
-#include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
@@ -8135,6 +8134,17 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	return promoted;
 }
 
+/* Set ControlFile's database state */
+void
+SetControlFileDBState(DBState state)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = state;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8976,13 +8986,7 @@ CreateCheckPoint(int flags)
 	START_CRIT_SECTION();
 
 	if (shutdown)
-	{
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
-		ControlFile->time = (pg_time_t) time(NULL);
-		UpdateControlFile();
-		LWLockRelease(ControlFileLock);
-	}
+		SetControlFileDBState(DB_SHUTDOWNING);
 
 	/*
 	 * Let smgr prepare for checkpoint; this has to happen before we determine
@@ -9523,13 +9527,8 @@ CreateRestartPoint(int flags)
 
 		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
-		{
-			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-			ControlFile->time = (pg_time_t) time(NULL);
-			UpdateControlFile();
-			LWLockRelease(ControlFileLock);
-		}
+			SetControlFileDBState(DB_SHUTDOWNED_IN_RECOVERY);
+
 		return false;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12beb..e730572b168 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -15,6 +15,7 @@
 #include "access/xlogdefs.h"
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -334,6 +335,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileDBState(DBState state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
-- 
2.18.0

v30-0005-Documentation.patchapplication/octet-stream; name=v30-0005-Documentation.patchDownload

From 51dbff81d598da546d83c1e68c8344ec409021c1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v30 5/5] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6388385edc5..c6ad66406f4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24883,9 +24883,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25002,6 +25002,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 22af7dbf51b..89da3eb2e94 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v30-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v30-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 14de4144eaaca900fc0e3e8ec950158654763c15 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v30 4/5] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 +++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 14 ++++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 46 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 +++++++++++++
 39 files changed, 500 insertions(+), 67 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index d31e5f31fd4..2938b937f75 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/* Check target relation. */
 	sanity_check_relation(rel);
@@ -217,6 +220,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -297,12 +303,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..8c672770e79 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1ff2e0c18ee..a6ac301ba0d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 49a98677876..f161a9a49eb 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39b..2945ea4b6ba 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..0cb9adf8b5d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88db2e2cfce..95e831d75ac 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1318,6 +1319,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1333,8 +1339,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1940,8 +1945,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1966,7 +1976,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2399,6 +2409,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2409,6 +2420,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2439,7 +2453,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..05f7de15c76 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -476,6 +488,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -489,8 +502,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -518,7 +536,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 271994b08df..99466b5a5a9 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b66d802c86e..b40af2937d9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2946,7 +2949,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813c564..ddbdda8b79d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a22bf375f85..fce9ad78afa 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index eb0e51301d9..e0cdc0dfe29 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6c609e0e4b4..b0363e82c77 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91cbb54b206..95a6f24dda1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1036,7 +1036,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2889,9 +2889,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9321,6 +9323,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9478,6 +9483,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10125,7 +10132,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10139,10 +10146,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10164,8 +10171,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 32b4cc84e79..6b669f32c02 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -126,9 +127,15 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -210,6 +217,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0415df9ccb7..ab108e621d4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 84f6f694977..fb778b56e74 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -935,6 +935,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a22c45..57d05e094c6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3904,13 +3904,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda2380..41ba8c59a88 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624cf0da..d91e247f3e4 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -841,6 +842,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ff77a68552c..e007307ac5d 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -56,4 +56,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 4dc343cbc59..c708ecdc743 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v30-0003-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v30-0003-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From d7518bf1032afda761fb679fbd132542f7e02533 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v30 3/5] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 482 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 167 ++++++--
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  59 +++
 src/include/access/xlog.h                |  14 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 843 insertions(+), 79 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..eb0e51301d9
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,482 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				ResetLocalXLogInsertAllowed();
+				HoldWALProhibitStateTransition = true;
+				XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
+
+				/*
+				 * We need to update DBState explicitly like the startup process
+				 * because end-of-recovery checkpoint would set db state to
+				 * shutdown.
+				 */
+				SetControlFileDBState(DB_IN_PRODUCTION);
+				HoldWALProhibitStateTransition = false;
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e8..6c609e0e4b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ac50d567be9..91cbb54b206 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -247,9 +248,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -731,6 +733,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -980,9 +988,6 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
-static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-							 TimeLineID EndOfLogTLI);
-
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -5226,6 +5231,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6249,6 +6255,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6614,13 +6629,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -7897,7 +7929,31 @@ StartupXLOG(void)
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
-	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		/*
+		 * We do start in recovery since at shutdown in wal prohibit state we
+		 * skip shutdown checkpoint, that forces recovery on restart.
+		 */
+		Assert(InRecovery);
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+		promoted = XLogAcceptWrites(InRecovery, xlogreader, EndOfLog, EndOfLogTLI);
 
 	/*
 	 * Okay, we're officially UP.
@@ -7958,14 +8014,29 @@ StartupXLOG(void)
  * Performs necessary WAL writes that must be done before any other backends are
  * allowed to write a WAL records when the server starts.
  */
-static bool
-XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
-				 TimeLineID EndOfLogTLI)
+bool
+XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+				 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI)
 {
 	bool		promoted = false;
 
-	/* Only Startup or standalone backend allowed to be here. */
-	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsPostmasterEnvironment);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
@@ -7975,7 +8046,7 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	if (needChkpt)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -8131,6 +8202,12 @@ XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
@@ -8145,6 +8222,17 @@ SetControlFileDBState(DBState state)
 	LWLockRelease(ControlFileLock);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8359,9 +8447,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8380,9 +8468,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8404,6 +8503,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8693,9 +8798,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8708,6 +8817,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -8957,8 +9069,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a416e94d371..0934478188e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -699,6 +699,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d516df0ac5c..ec64394e81b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -701,10 +701,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb6..5157237731c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -278,7 +278,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 75a95f3de7a..84f6f694977 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -351,6 +352,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -699,6 +701,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1346,3 +1351,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d90238..01a40d805ff 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -793,15 +793,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index bc3ceb27125..417662d28a1 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1a8fc167733..6996dac317a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6baf67740c7..5a21fbbbbfa 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -726,6 +726,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68b62d523dc..a9cd7adec2c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -234,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -658,6 +660,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2109,6 +2112,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12519,4 +12534,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ff77a68552c
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,59 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e730572b168..936d143eaed 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -168,6 +168,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -316,6 +324,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -324,6 +333,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -335,7 +345,11 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingStartupOperations(void);
+extern bool XLogAcceptWrites(bool needChkpt, XLogReaderState *xlogreader,
+							 XLogRecPtr EndOfLog, TimeLineID EndOfLogTLI);
 extern void SetControlFileDBState(DBState state);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fde251fa4f3..098a22e81f2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11573,6 +11573,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6c6ec2e7118..e95b3197cc5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -224,7 +224,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb08319ca..b8f2e22d7e6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2816,6 +2816,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v30-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchapplication/octet-stream; name=v30-0001-Refactor-separate-WAL-writing-code-from-StartupX.patchDownload

From cc21b4a50d45b4d1d453b30864c8053685fc32c7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 11 May 2021 04:07:59 -0400
Subject: [PATCH v30 1/5] Refactor: separate WAL writing code from
 StartupXLOG().

Introducing a new function as XLogAcceptWrites() and moved following
code from StartupXLOG() in it:

1. UpdateFullPageWrites(),
2. The following block of code that does either
   CreateEndOfRecoveryRecord() or RequestCheckpoint() or
   CreateCheckPoint(),
3. The next block of code that runs recovery_end_command,
4. XLogReportParameters(), and
5. CompleteCommitTsInitialization().

XLogAcceptWrites() function planned to call from the place where
XLogReportParameters() was in StartupXLOG().

Now, "InRecovery" flag will be reset after XLogAcceptWrites() call,
and due to this assertion from SetMultiXactIdLimit() need to removed
since that function get called via TrimMultiXact() before InRecovery
get reset.
---
 src/backend/access/transam/multixact.c |   2 -
 src/backend/access/transam/xlog.c      | 224 ++++++++++++++-----------
 2 files changed, 128 insertions(+), 98 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index daab546f296..b66d802c86e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2290,8 +2290,6 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid,
 	if (!MultiXactState->finishedStartup)
 		return;
 
-	Assert(!InRecovery);
-
 	/* Set limits for offset vacuum. */
 	needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17eeff07200..445f13b28a3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -981,6 +981,9 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+							 TimeLineID EndOfLogTLI);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -7852,11 +7855,123 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory. XLOG_FPW_CHANGE record will be
+	 * written later will accepting WAL writes XLogAcceptWrites().
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/* Start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* Also, initialize latestCompletedXid to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
+	promoted = XLogAcceptWrites(xlogreader, EndOfLog, EndOfLogTLI);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * All done with end-of-recovery actions.
+	 *
+	 * Now allow backends to write WAL and update the control file status in
+	 * consequence.  SharedRecoveryState, that controls if backends can write
+	 * WAL, is updated while holding ControlFileLock to prevent other backends
+	 * to look at an inconsistent state of the control file in shared memory.
+	 * There is still a small window during which backends can write WAL and
+	 * the control file is still referring to a system not in DB_IN_PRODUCTION
+	 * state while looking at the on-disk control file.
+	 *
+	 * Also, we use info_lck to update SharedRecoveryState to ensure that
+	 * there are no race conditions concerning visibility of other recent
+	 * updates to shared memory.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This
+	 * isn't required for consistency, but the last restartpoint might be far
+	 * back, and in case of a crash, recovering from it might take a longer
+	 * than is appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+/*
+ * Performs necessary WAL writes that must be done before any other backends are
+ * allowed to write a WAL records when the server starts.
+ */
+static bool
+XLogAcceptWrites(XLogReaderState *xlogreader, XLogRecPtr EndOfLog,
+				 TimeLineID EndOfLogTLI)
+{
+	bool		promoted = false;
+
+	/* Only Startup or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	/*
+	 * Write an XLOG_FPW_CHANGE record before resource manager writes cleanup
+	 * WAL records or checkpoint record is written.
+	 */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7879,15 +7994,20 @@ StartupXLOG(void)
 		 */
 		if (bgwriterLaunched)
 		{
+			/* bgwriterLaunched is only true in startup process */
+			Assert(AmStartupProcess());
+
 			if (LocalPromoteIsTriggered)
 			{
-				checkPointLoc = ControlFile->checkPoint;
+				XLogRecord *record;
 
 				/*
 				 * Confirm the last checkpoint is available for us to recover
 				 * from if we fail.
 				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+				record = ReadCheckpointRecord(xlogreader,
+											  ControlFile->checkPoint,
+											  1, false);
 				if (record != NULL)
 				{
 					promoted = true;
@@ -7918,6 +8038,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		Assert(AmStartupProcess());
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7997,57 +8119,6 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/*
-	 * Shutdown the recovery environment. This must occur after
-	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
-	 */
-	if (standbyState != STANDBY_DISABLED)
-		ShutdownRecoveryTransactionEnvironment();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
@@ -8061,46 +8132,7 @@ StartupXLOG(void)
 	 */
 	CompleteCommitTsInitialization();
 
-	/*
-	 * All done with end-of-recovery actions.
-	 *
-	 * Now allow backends to write WAL and update the control file status in
-	 * consequence.  SharedRecoveryState, that controls if backends can write
-	 * WAL, is updated while holding ControlFileLock to prevent other backends
-	 * to look at an inconsistent state of the control file in shared memory.
-	 * There is still a small window during which backends can write WAL and
-	 * the control file is still referring to a system not in DB_IN_PRODUCTION
-	 * state while looking at the on-disk control file.
-	 *
-	 * Also, we use info_lck to update SharedRecoveryState to ensure that
-	 * there are no race conditions concerning visibility of other recent
-	 * updates to shared memory.
-	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	ControlFile->state = DB_IN_PRODUCTION;
-	ControlFile->time = (pg_time_t) time(NULL);
-
-	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
-	SpinLockRelease(&XLogCtl->info_lck);
-
-	UpdateControlFile();
-	LWLockRelease(ControlFileLock);
-
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	return promoted;
 }
 
 /*
-- 
2.18.0

#135

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#134)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 17, 2021 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is rebase for the latest master head. Also, I added one more
refactoring code that deduplicates the code setting database state in the
control file. The same code set the database state is also needed for this
feature.

I started studying 0001 today and found that it rearranged the order
of operations in StartupXLOG() more than I was expecting. It does, as
per previous discussions, move a bunch of things to the place where we
now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
XLogReaderFree() then have to move down even further. Since the goal
here is to get to a situation where we sometimes XLogAcceptWrites()
after InRecovery = false, it didn't seem nice for this refactoring
patch to still end up with a situation where this stuff happens while
InRecovery = true. In fact, with the patch, the amount of code that
runs with InRecovery = true actually *increases*, which is not what I
think should be happening here. That's why the patch ends up having to
adjust SetMultiXactIdLimit to not Assert(!InRecovery).

And then I started to wonder how this was ever going to work as part
of the larger patch set, because as you have it here,
XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
checkpointer is calling that at a later time after the user issues
pg_prohibit_wal(false), it's going to have none of those things. So I
had a quick look at that part of the code and found this in
checkpointer.c:

XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);

For those following along from home, the additional "true" is a bool
needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
this is very satisfying. The whole purpose of passing the xlogreader
is so we can figure out whether we need a checkpoint (never mind the
question of whether the existing algorithm for determining that is
really sensible) but now we need a second argument that basically
serves the same purpose since one of the two callers to this function
won't have an xlogreader. And then we're passing the EndOfLog and
EndOfLogTLI as dummy values which seems like it's probably just
totally wrong, but if for some reason it works correctly there sure
don't seem to be any comments explaining why.

So I started doing a bit of hacking myself and ended up with the
attached, which I think is not completely the right thing yet but I
think it's better than your version. I split this into three parts.
0001 splits up the logic that currently decides whether to write an
end-of-recovery record or a checkpoint record and if the latter how
the checkpoint ought to be performed into two functions.
DetermineRecoveryXlogAction() figures out what we want to do, and
PerformRecoveryXlogAction() does it. It also moves the code to run
recovery_end_command and related stuff into a new function
CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
CleanupAfterArchiveRecovery() to just before we
XLogReportParameters(). Because of the refactoring done by 0001, this
is only a small amount of code movement. Because of the separation
between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
the latter doesn't need the xlogreader. So we can do
DetermineRecoveryXlogAction() at the same time as now, while the
xlogreader is available, and then we don't need it later when we
PerformRecoveryXlogAction(), because we already know what we need to
know. I think this is all fine as far as it goes.

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
seem too problematic, because that value has been stored in four (!)
places inside XLogCtl by this code:

LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

XLogCtl->LogwrtResult = LogwrtResult;

XLogCtl->LogwrtRqst.Write = EndOfLog;
XLogCtl->LogwrtRqst.Flush = EndOfLog;

Presumably we could relatively easily change things around so that we
finish one of those values ... probably one of the "write" values ..
back out of XLogCtl instead of passing it as a parameter. That would
work just as well from the checkpointer as from the startup process,
and there seems to be no way for the value to change until after
XLogAcceptWrites() has been called, so it seems fine. But that doesn't
help for the other arguments. What I'm thinking is that we should just
arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and
then XLogAcceptWrites() can fish those values out of there as well,
which should be enough to make it work and do the same thing
regardless of which process is calling it. But I have run out of time
for today so have not explored coding that up.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0001-Refactor-some-end-of-recovery-code-out-of-StartupXLO.patchapplication/octet-stream; name=0001-Refactor-some-end-of-recovery-code-out-of-StartupXLO.patchDownload

From 407958a9230152420ce72c92dbe08ee38bbafaf2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH 1/3] Refactor some end-of-recovery code out of StartupXLOG().

Split the code that performs whether to write a checkpoint or an
end-of-recovery record into DetermineRecoveryXlogAction(), which
decides what to do, and PerformRecoveryXlogAction(). Right now
these are always called one after the other, but further refactoring
is planned which will separate them.

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.
---
 src/backend/access/transam/xlog.c | 351 ++++++++++++++++++------------
 1 file changed, 216 insertions(+), 135 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402272..203a9babc9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -525,6 +525,31 @@ typedef enum ExclusiveBackupState
 	EXCLUSIVE_BACKUP_STOPPING
 } ExclusiveBackupState;
 
+/*
+ * What should we do when we reach the end of REDO to ensure that we'll
+ * be able to recover properly if we crash again?
+ *
+ * RECOVERY_XLOG_NOTHING means we didn't actually REDO anything and therefore
+ * no action is required.
+ *
+ * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
+ * end-of-recovery record but don't need to checkpoint.
+ *
+ * RECOVERY_XLOG_WRITE_CHECKPOINT means we need to write a checkpoint.
+ * This is only valid when the checkpointer is not running.
+ *
+ * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
+ * checkpointer perform a checkpoint. This is only valid when the
+ * checkpointer is running.
+ */
+typedef enum
+{
+	RECOVERY_XLOG_NOTHING,
+	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_WRITE_CHECKPOINT,
+	RECOVERY_XLOG_REQUEST_CHECKPOINT
+} RecoveryXlogAction;
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -902,6 +927,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
@@ -946,6 +973,8 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static RecoveryXlogAction DetermineRecoveryXlogAction(XLogReaderState *xlogreader);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5717,6 +5746,88 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be
+	 * pre-allocated files containing garbage. In any case, they are not part
+	 * of the new timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because
+	 * it was hit by a meteor), it will never make it to the archive. That's
+	 * OK from our point of view, because the new segment that we created with
+	 * the new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically
+	 * present in the new file with new TLI, but recovery won't look there
+	 * when it's recovering to the older timeline. On the other hand, if we
+	 * archive the partial segment, and the original server on that timeline
+	 * is still running and archives the completed version of the same segment
+	 * later, it will fail. (We used to do that in 9.4 and below, and it
+	 * caused such problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix,
+	 * and archive it. Archive recovery will never try to read .partial
+	 * segments, so they will normally go unused. But in the odd PITR case,
+	 * the administrator can copy them manually to the pg_wal directory
+	 * (removing the suffix). They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let
+	 * it be archived normally. (In particular, if it was restored from the
+	 * archive to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -6490,7 +6601,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	RecoveryXlogAction xlogaction;
 	struct stat st;
 
 	/*
@@ -7897,141 +8008,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (bgwriterLaunched)
-		{
-			if (LocalPromoteIsTriggered)
-			{
-				checkPointLoc = ControlFile->checkPoint;
-
-				/*
-				 * Confirm the last checkpoint is available for us to recover
-				 * from if we fail.
-				 */
-				record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
-				if (record != NULL)
-				{
-					promoted = true;
-
-					/*
-					 * Insert a special WAL record to mark the end of
-					 * recovery, since we aren't doing a checkpoint. That
-					 * means that the checkpointer process may likely be in
-					 * the middle of a time-smoothed restartpoint and could
-					 * continue to be for minutes after this. That sounds
-					 * strange, but the effect is roughly the same and it
-					 * would be stranger to try to come out of the
-					 * restartpoint and then checkpoint. We request a
-					 * checkpoint later anyway, just for safety.
-					 */
-					CreateEndOfRecoveryRecord();
-				}
-			}
-
-			if (!promoted)
-				RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-								  CHECKPOINT_IMMEDIATE |
-								  CHECKPOINT_WAIT);
-		}
-		else
-			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
-	}
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	xlogaction = DetermineRecoveryXlogAction(xlogreader);
+	PerformRecoveryXLogAction(xlogaction);
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8135,7 +8118,7 @@ StartupXLOG(void)
 	 * and in case of a crash, recovering from it might take a longer than is
 	 * appropriate now that we're not in standby mode anymore.
 	 */
-	if (promoted)
+	if (xlogaction == RECOVERY_XLOG_WRITE_END_OF_RECOVERY)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8235,6 +8218,104 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Determine what needs to be done upon completing REDO.
+ */
+static RecoveryXlogAction
+DetermineRecoveryXlogAction(XLogReaderState *xlogreader)
+{
+	/* No REDO, hence no action required. */
+	if (!InRecovery)
+		return RECOVERY_XLOG_NOTHING;
+
+	/*
+	 * bgwriterLaunched actually indicates both whether the bgwriter process
+	 * has been launched and also whether the checkpointer process has been
+	 * launched. So, if it's false, we can't request a checkpoint and must do
+	 * it locally.
+	 *
+	 * NB: We don't launch the bgwriter and checkpointer during crash
+	 * recovery, which will therefore always write a checkpoint.
+	 */
+	if (!bgwriterLaunched)
+		return RECOVERY_XLOG_WRITE_CHECKPOINT;
+
+	/*
+	 * In promotion, only create a lightweight end-of-recovery record instead
+	 * of a full checkpoint. A checkpoint is requested later, after we're
+	 * fully out of recovery mode and already accepting WAL writes.
+	 */
+	if (LocalPromoteIsTriggered)
+	{
+		XLogRecPtr	checkPointLoc = ControlFile->checkPoint;
+		XLogRecord *record;
+
+		/*
+		 * Confirm the last checkpoint is available for us to recover from if
+		 * we fail.
+		 */
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
+		if (record != NULL)
+		{
+			/*
+			 * Insert a special WAL record to mark the end of recovery, since
+			 * we aren't doing a checkpoint. That means that the checkpointer
+			 * process may likely be in the middle of a time-smoothed
+			 * restartpoint and could continue to be for minutes after this.
+			 * That sounds strange, but the effect is roughly the same and it
+			 * would be stranger to try to come out of the restartpoint and
+			 * then checkpoint. We request a checkpoint later anyway, just for
+			 * safety.
+			 */
+			return RECOVERY_XLOG_WRITE_END_OF_RECOVERY;
+		}
+	}
+
+	/*
+	 * We decided against writing only an end-of-recovery record, and we know
+	 * that the postmaster was told to launch the checkpointer, so just
+	 * request a checkpoint.
+	 */
+	return RECOVERY_XLOG_REQUEST_CHECKPOINT;
+}
+
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static void
+PerformRecoveryXLogAction(RecoveryXlogAction action)
+{
+	switch (action)
+	{
+		case RECOVERY_XLOG_NOTHING:
+			/* No REDO performed, hence nothing to do. */
+			break;
+
+		case RECOVERY_XLOG_WRITE_END_OF_RECOVERY:
+			/* Lightweight end-of-recovery record in lieu of checkpoint. */
+			CreateEndOfRecoveryRecord();
+			break;
+
+		case RECOVERY_XLOG_WRITE_CHECKPOINT:
+			/* Full checkpoint, when checkpointer is not running. */
+			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+			break;
+
+		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
+			/* Full checkpoint, when checkpointer is running. */
+			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+							  CHECKPOINT_IMMEDIATE |
+							  CHECKPOINT_WAIT);
+	}
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.24.3 (Apple Git-128)

0002-Postpone-some-end-of-recovery-operations-relating-to.patchapplication/octet-stream; name=0002-Postpone-some-end-of-recovery-operations-relating-to.patchDownload

From e1f1e4c6aa5ddb2f656978d84577c45cc0a83c00 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH 2/3] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, we issued XLOG_FPW_CHANGE and either
XLOG_CHECKPOINT_SHUTDOWN or XLOG_END_OF_RECOVERY while still
technically in recovery, and also performed post-archive-recovery
cleanup steps at that point. Postpone that stuff until after we clear
InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.
---
 src/backend/access/transam/xlog.c | 34 ++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 203a9babc9..c652a0635d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7999,22 +7999,11 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Figure out what xlog activity is needed to mark end of recovery. We
+	 * must make this determination before setting InRecovery = false, or
+	 * we'll get the wrong answer.
 	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	xlogaction = DetermineRecoveryXlogAction(xlogreader);
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8067,6 +8056,23 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.24.3 (Apple Git-128)

0003-Create-XLogAcceptWrites-function-with-code-from-Star.patchapplication/octet-stream; name=0003-Create-XLogAcceptWrites-function-with-code-from-Star.patchDownload

From 0f14c0e6534e4d24432dffc6388ef04543ff4fe6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 15:37:53 -0400
Subject: [PATCH 3/3] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.
---
 src/backend/access/transam/xlog.c | 75 +++++++++++++++++++------------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c652a0635d..9707bba1d3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -973,6 +973,9 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static void XLogAcceptWrites(RecoveryXlogAction xlogaction,
+							 TimeLineID EndOfLogTLI,
+							 XLogRecPtr EndOfLog);
 static RecoveryXlogAction DetermineRecoveryXlogAction(XLogReaderState *xlogreader);
 static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
@@ -8056,35 +8059,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8128,6 +8104,47 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static void
+XLogAcceptWrites(RecoveryXlogAction xlogaction,
+				 TimeLineID EndOfLogTLI,
+				 XLogRecPtr EndOfLog)
+{
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.24.3 (Apple Git-128)

#136

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Robert Haas (#135)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog?

On further study, I found another problem: the way my patch set leaves
things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
will not be correctly initialized in any process other than the
startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
would just be skipped. Your 0001 seems to have the same problem. You
added Assert(AmStartupProcess()) to the inside of the if
(ArchiveRecoveryRequested) block, but that doesn't fix anything.
Outside the startup process, ArchiveRecoveryRequested will always be
false, but the point is that the associated stuff should be done if
ArchiveRecoveryRequested would have been true in the startup process.
Both of our patch sets leave things in a state where that would never
happen, which is not good. Unless I'm missing something, it seems like
maybe you didn't test your patches to verify that, when the
XLogAcceptWrites() call comes from the checkpointer, all the same
things happen that would have happened had it been called from the
startup process. That would be a really good thing to have tested
before posting your patches.

As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
several TLIs stored in XLogCtl. None of them seem to be precisely the
same thing as EndLogTLI, but I am hoping that replayEndTLI is close
enough. I found out pretty quickly through testing that replayEndTLI
isn't always valid -- it ends up 0 if we don't enter recovery. That's
not really a problem, though, because we only need it to be valid if
ArchiveRecoveryRequested. The code that initializes and updates it
seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
= true will force InRecovery = true. So it looks to me like
replayEndTLI will always be initialized in the cases where we need a
value. It's not yet entirely clear to me if it has to have the same
value as EndOfLogTLI. I find this code comment quite mysterious:

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

The thing is, if we were reading old WAL from a segment belonging to a
higher timeline, wouldn't we have switched to that new timeline?
Suppose we want WAL segment 246 from TLI 1, but we don't have that
segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
the TLI 2 version, we'd need to have TLI 2 in the history of the
recovery_target_timeline. And if that is the case, then we would have
to replay through the record where the timeline changes. And if we do
that, then the discrepancy postulated by the comment cannot still
exist by the time we reach this code, because this code is only
reached after we finish WAL redo. So I'm baffled as to how this can
happen, but considering how many cases there are in this code, I sure
can't promise that it doesn't. The fact that we have few tests for any
of this doesn't help either.

--
Robert Haas
EDB: http://www.enterprisedb.com

#137

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#136)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jul 28, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog?

On further study, I found another problem: the way my patch set leaves
things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
will not be correctly initialized in any process other than the
startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
would just be skipped. Your 0001 seems to have the same problem. You
added Assert(AmStartupProcess()) to the inside of the if
(ArchiveRecoveryRequested) block, but that doesn't fix anything.
Outside the startup process, ArchiveRecoveryRequested will always be
false, but the point is that the associated stuff should be done if
ArchiveRecoveryRequested would have been true in the startup process.
Both of our patch sets leave things in a state where that would never
happen, which is not good. Unless I'm missing something, it seems like
maybe you didn't test your patches to verify that, when the
XLogAcceptWrites() call comes from the checkpointer, all the same
things happen that would have happened had it been called from the
startup process. That would be a really good thing to have tested
before posting your patches.

My bad, I am extremely sorry about that. I usually do test my patches,
but somehow I failed to test this change due to manually testing the
whole ASRO feature and hurrying in posting the newest version.

I will try to be more careful next time.

As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
several TLIs stored in XLogCtl. None of them seem to be precisely the
same thing as EndLogTLI, but I am hoping that replayEndTLI is close
enough. I found out pretty quickly through testing that replayEndTLI
isn't always valid -- it ends up 0 if we don't enter recovery. That's
not really a problem, though, because we only need it to be valid if
ArchiveRecoveryRequested. The code that initializes and updates it
seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
= true will force InRecovery = true. So it looks to me like
replayEndTLI will always be initialized in the cases where we need a
value. It's not yet entirely clear to me if it has to have the same
value as EndOfLogTLI. I find this code comment quite mysterious:

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

The thing is, if we were reading old WAL from a segment belonging to a
higher timeline, wouldn't we have switched to that new timeline?

AFAIUC, by browsing the code, yes, we are switching to the new
timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also
the same as the EndOfLog that we needed when ArchiveRecoveryRequested
is true.

I went through the original commit 7cbee7c0a1db and the thread[1] but
didn't find any related discussion for that.

Suppose we want WAL segment 246 from TLI 1, but we don't have that
segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
the TLI 2 version, we'd need to have TLI 2 in the history of the
recovery_target_timeline. And if that is the case, then we would have
to replay through the record where the timeline changes. And if we do
that, then the discrepancy postulated by the comment cannot still
exist by the time we reach this code, because this code is only
reached after we finish WAL redo. So I'm baffled as to how this can
happen, but considering how many cases there are in this code, I sure
can't promise that it doesn't. The fact that we have few tests for any
of this doesn't help either.

I am not an expert in this area, but will try to spend some more time
on understanding and testing.

1] postgr.es/m/555DD101.7080209@iki.fi

Regards,
Amul

#138

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Amul Sul (#137)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jul 28, 2021 at 4:37 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Jul 28, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog?

On further study, I found another problem: the way my patch set leaves
things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
will not be correctly initialized in any process other than the
startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
would just be skipped. Your 0001 seems to have the same problem. You
added Assert(AmStartupProcess()) to the inside of the if
(ArchiveRecoveryRequested) block, but that doesn't fix anything.
Outside the startup process, ArchiveRecoveryRequested will always be
false, but the point is that the associated stuff should be done if
ArchiveRecoveryRequested would have been true in the startup process.
Both of our patch sets leave things in a state where that would never
happen, which is not good. Unless I'm missing something, it seems like
maybe you didn't test your patches to verify that, when the
XLogAcceptWrites() call comes from the checkpointer, all the same
things happen that would have happened had it been called from the
startup process. That would be a really good thing to have tested
before posting your patches.

My bad, I am extremely sorry about that. I usually do test my patches,
but somehow I failed to test this change due to manually testing the
whole ASRO feature and hurrying in posting the newest version.

I will try to be more careful next time.

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

@@ -6614,13 +6629,30 @@ StartupXLOG(void)
(errmsg("starting archive recovery")));
}

- /*
- * Take ownership of the wakeup latch if we're going to sleep during
- * recovery.
- */
  if (ArchiveRecoveryRequested)
+ {
+ /*
+ * Take ownership of the wakeup latch if we're going to sleep during
+ * recovery.
+ */
  OwnLatch(&XLogCtl->recoveryWakeupLatch);

+ /*
+ * Since archive recovery is requested, we cannot be in a wal prohibited
+ * state.
+ */
+ if (ControlFile->wal_prohibited)
+ {
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
+ ControlFile->wal_prohibited = false;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+
+ ereport(LOG,
+ (errmsg("clearing WAL prohibition because the system is in archive
recovery")));
+ }
+ }
+

Show quoted text

As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
several TLIs stored in XLogCtl. None of them seem to be precisely the
same thing as EndLogTLI, but I am hoping that replayEndTLI is close
enough. I found out pretty quickly through testing that replayEndTLI
isn't always valid -- it ends up 0 if we don't enter recovery. That's
not really a problem, though, because we only need it to be valid if
ArchiveRecoveryRequested. The code that initializes and updates it
seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
= true will force InRecovery = true. So it looks to me like
replayEndTLI will always be initialized in the cases where we need a
value. It's not yet entirely clear to me if it has to have the same
value as EndOfLogTLI. I find this code comment quite mysterious:

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

The thing is, if we were reading old WAL from a segment belonging to a
higher timeline, wouldn't we have switched to that new timeline?

AFAIUC, by browsing the code, yes, we are switching to the new
timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also
the same as the EndOfLog that we needed when ArchiveRecoveryRequested
is true.

I went through the original commit 7cbee7c0a1db and the thread[1] but
didn't find any related discussion for that.

Suppose we want WAL segment 246 from TLI 1, but we don't have that
segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
the TLI 2 version, we'd need to have TLI 2 in the history of the
recovery_target_timeline. And if that is the case, then we would have
to replay through the record where the timeline changes. And if we do
that, then the discrepancy postulated by the comment cannot still
exist by the time we reach this code, because this code is only
reached after we finish WAL redo. So I'm baffled as to how this can
happen, but considering how many cases there are in this code, I sure
can't promise that it doesn't. The fact that we have few tests for any
of this doesn't help either.

I am not an expert in this area, but will try to spend some more time
on understanding and testing.

1] postgr.es/m/555DD101.7080209@iki.fi

Regards,
Amul

#139

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Amul Sul (#138)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jul 28, 2021 at 5:03 PM Amul Sul <sulamul@gmail.com> wrote:

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

@@ -6614,13 +6629,30 @@ StartupXLOG(void)
(errmsg("starting archive recovery")));
}
- /*
- * Take ownership of the wakeup latch if we're going to sleep during
- * recovery.
- */
if (ArchiveRecoveryRequested)
+ {
+ /*
+ * Take ownership of the wakeup latch if we're going to sleep during
+ * recovery.
+ */
OwnLatch(&XLogCtl->recoveryWakeupLatch);
+ /*
+ * Since archive recovery is requested, we cannot be in a wal prohibited
+ * state.
+ */
+ if (ControlFile->wal_prohibited)
+ {
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
+ ControlFile->wal_prohibited = false;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+

Is there some reason why we are forcing 'wal_prohibited' to off if we
are doing archive recovery? It might have already been discussed, but
I could not find it on a quick look into the thread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#140

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#139)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jul 29, 2021 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 28, 2021 at 5:03 PM Amul Sul <sulamul@gmail.com> wrote:
I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

@@ -6614,13 +6629,30 @@ StartupXLOG(void)
(errmsg("starting archive recovery")));
}
- /*
- * Take ownership of the wakeup latch if we're going to sleep during
- * recovery.
- */
if (ArchiveRecoveryRequested)
+ {
+ /*
+ * Take ownership of the wakeup latch if we're going to sleep during
+ * recovery.
+ */
OwnLatch(&XLogCtl->recoveryWakeupLatch);
+ /*
+ * Since archive recovery is requested, we cannot be in a wal prohibited
+ * state.
+ */
+ if (ControlFile->wal_prohibited)
+ {
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
+ ControlFile->wal_prohibited = false;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+
Is there some reason why we are forcing 'wal_prohibited' to off if we
are doing archive recovery? It might have already been discussed, but
I could not find it on a quick look into the thread.

Here is: /messages/by-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com

Regards,
Amul

#141

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#138)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#142

Prabhat Sahu

prabhat.sahu@enterprisedb.com

over 4 years ago

In reply to: Robert Haas (#141)

1 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

Hi,

I have been testing “ALTER SYSTEM READ ONLY” and wrote a few tap test cases
for this feature.
Please find the test case(Draft version) attached herewith, to be applied
on top of the v30 patch by Amul.
Kindly have a review and let me know the required changes.
--

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Attachments:

prohibitwal-tap-test.patchapplication/octet-stream; name=prohibitwal-tap-test.patchDownload

diff --git a/src/test/prohibit_wal/Makefile b/src/test/prohibit_wal/Makefile
new file mode 100644
index 0000000..51767ff
--- /dev/null
+++ b/src/test/prohibit_wal/Makefile
@@ -0,0 +1,24 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/prohibit_wal
+#
+# Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/prohibit_wal/Makefile
+#
+#-------------------------------------------------------------------------
+
+
+subdir = src/test/prohibit_wal
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/prohibit_wal/t/001-pg_prohibit_wal.pl b/src/test/prohibit_wal/t/001-pg_prohibit_wal.pl
new file mode 100644
index 0000000..18cb721
--- /dev/null
+++ b/src/test/prohibit_wal/t/001-pg_prohibit_wal.pl
@@ -0,0 +1,234 @@
+use strict;
+use warnings;
+use Cwd;
+use Config;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+my $psql_timeout = IPC::Run::timer(60);
+my $primary = get_new_node('primary');
+
+# initialize the clusture;
+$primary->init(
+        allows_streaming => 1,
+        auth_extra       => [ '--create-role', 'repl_role' ]);
+$primary->append_conf('postgresql.conf', "max_replication_slots = 2");
+$primary->start;
+
+# Below testcase verify the syntax and corresponding wal_prohibited values.
+$primary->command_ok([ 'psql', '-c', "SELECT pg_prohibit_wal(TRUE);" ],
+                    "pg_prohibit_wal(TRUE) syntax verified.");
+my $data_check = $primary->safe_psql('postgres', 'SHOW wal_prohibited;');
+is($data_check, 'on', "wal_prohibited is 'on' with pg_prohibit_wal(TRUE) verified");
+
+$primary->command_ok([ 'psql', '-c', "SELECT pg_prohibit_wal(FALSE);" ],
+                    "pg_prohibit_wal(FALSE) syntax verified.");
+my $data_check = $primary->safe_psql('postgres', 'SHOW wal_prohibited;');
+is($data_check, 'off', "wal_prohibited is 'off' with pg_prohibit_wal(FALSE) verified");
+
+# Below testcase verify an user can execute CREATE/INSERT statement
+# with pg_prohibit_wal(false).
+$primary->safe_psql('postgres',  <<EOM);
+SELECT pg_prohibit_wal(false);
+CREATE TABLE test_tbl (id int);
+INSERT INTO test_tbl values(10);
+EOM
+my $data_check = $primary->safe_psql('postgres', 'SELECT count(id) FROM test_tbl;');
+is($data_check, '1', 'CREATE/INSERT statement is working fine with pg_prohibit_wal(false)');
+
+# Below testcase verify an user can not execute CREATE TABLE command
+# in a read-only transaction.
+$primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true);');
+my ($stdout, $stderr, $timed_out);
+$primary->psql('postgres', 'create table test_tbl2 (id int);',
+          stdout => \$stdout, stderr => \$stderr);
+is($stderr, "psql:<stdin>:1: ERROR:  cannot execute CREATE TABLE in a read-only transaction",
+                                  "cannot execute CREATE TABLE in a read-only transaction");
+
+# Below testcase verify if permission denied for function pg_prohibit_wal(true/false).
+$primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false);');
+$primary->safe_psql('postgres', 'CREATE USER non_superuser;');
+$primary->psql('postgres', 'SELECT pg_prohibit_wal(true);',
+          stdout => \$stdout, stderr => \$stderr,
+          extra_params => [ '-U', 'non_superuser' ]);
+is($stderr, "psql:<stdin>:1: ERROR:  permission denied for function pg_prohibit_wal",
+                                    "permission denied for function pg_prohibit_wal");
+
+# Below testcase verify the GRANT/REVOKE permission on function
+# pg_prohibit_wal TO/FROM non_superuser.
+# if non_superuser can execute pg_prohibit_wal function only after getting execute permission
+$primary->command_ok([ 'psql', '-c', "GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser;" ],
+                    "Grant permission on function pg_prohibit_wal to non_superuser verified");
+
+$primary->command_ok([ 'psql', '-U', 'non_superuser', '-d', 'postgres', '-c', "SELECT pg_prohibit_wal(true);" ],
+                    "Non_superuser can execute pg_prohibit_wal(true) after getting execute permission");
+
+# Below testcase verify 'WAL write prohibited' from the 'pg_controldata'
+# when "pg_prohibit_wal(true)" and START/STOP/RESTART the server.
+$primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true);');
+my $primary_data = $primary->data_dir;
+my ($res, $controldataparm) = ('','');
+
+$res = get_controldata_parm('WAL write prohibited');
+is( $res, 'WAL write prohibited:                 yes', "Verified, 'WAL write prohibited: yes' after pg_prohibit_wal(TRUE).");
+
+$primary->stop;
+$primary->start;
+$res = get_controldata_parm('WAL write prohibited');
+is( $res, 'WAL write prohibited:                 yes', "Verified, 'WAL write prohibited: yes' after pg_prohibit_wal(TRUE) and STOP/START Server.");
+
+$primary->restart;
+$res = get_controldata_parm('WAL write prohibited');
+is( $res, 'WAL write prohibited:                 yes', "Verified, 'WAL write prohibited: yes' after pg_prohibit_wal(TRUE) and RESTART Server.");
+
+# Below testcase verify 'WAL write prohibited' from the 'pg_controldata with adding STANDBY.SIGNAL'
+# when "pg_prohibit_wal(true)" and RESTART the server.
+system("touch $primary_data/standby.signal");
+$primary->restart;
+$res = get_controldata_parm('WAL write prohibited');
+is( $res, 'WAL write prohibited:                 no', "Verified, 'WAL write prohibited: no' after pg_prohibit_wal(TRUE) with adding STANDBY.SIGNAL and RESTART Server.");
+
+system("rm -rf $primary_data/standby.signal");
+$primary->restart;
+
+$primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false);');
+
+# Below test will verify, open a transaction in one session, which perform write
+# operation (e.g create/insert)  and from another session execute pg_prohibit_wal(true)
+# this should kill transaction.
+
+# Sessio1: Execute some transaction.
+my ($session1_stdin, $session1_stdout, $session1_stderr) = ('', '', '');
+my $session1 = IPC::Run::start(
+        [
+                'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+                $primary->connstr('postgres')
+        ],
+        '<',
+        \$session1_stdin,
+        '>',
+        \$session1_stdout,
+        '2>',
+        \$session1_stderr,
+    $psql_timeout);
+
+$session1_stdin .= q[
+BEGIN;
+INSERT INTO test_tbl (SELECT generate_series(1,1000,2));
+];
+
+$session1->run();
+# Session2: Execute pg_prohibit_wal(TRUE)
+$primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(TRUE);');
+
+
+ok( pump_until(
+                $session1,
+                \$session1_stderr,
+                qr/psql:<stdin>:3: ERROR:  cannot execute INSERT in a read-only transaction/m
+        ),
+        "Verified, cannot execute INSERT in a read-only transaction");
+
+$session1->finish;
+
+sub pump_until
+{
+        my ($proc, $stream, $untl) = @_;
+        $proc->pump_nb();
+        while (1)
+        {
+                last if $$stream =~ /$untl/;
+                if ($psql_timeout->is_expired)
+                {
+                        diag("aborting wait: program timed out");
+                        diag("stream contents: >>", $$stream, "<<");
+                        diag("pattern searched for: ", $untl);
+
+                        return 0;
+                }
+                if (not $proc->pumpable())
+                {
+                        diag("aborting wait: program died");
+                        diag("stream contents: >>", $$stream, "<<");
+                        diag("pattern searched for: ", $untl);
+
+                        return 0;
+                }
+                $proc->pump();
+        }
+        return 1;
+}
+
+sub get_controldata_parm
+{
+  my $param_name = @_;
+  my ($stdout, $stderr) = run_command([ 'pg_controldata', $primary_data ]);
+  my @control_data = split("\n", $stdout);
+  foreach (@control_data)
+  {
+    if (index($_, $_[0]) != -1)
+    {
+        diag("pg_controldata content: >>", $_, "\n");
+        $controldataparm = $_;
+    }
+  }
+  return $controldataparm;
+}
+
+$primary->restart;
+
+# Below testcase verify an user can execute CREATE TABLE AS SELECT statement
+# with pg_prohibit_wal(false).
+
+$primary->safe_psql('postgres', "SELECT pg_prohibit_wal(FALSE);" );
+$primary->safe_psql('postgres', "CREATE TABLE test_tbl2 AS SELECT generate_series(1,12) AS a");
+
+my $data_check = $primary->safe_psql('postgres', 'SELECT count(a) FROM test_tbl2;');
+is($data_check, '12', 'CREATE TABLE statement is working fine with pg_prohibit_wal(false) on master');
+
+my $backup_name = 'my_backup';
+
+# Take backup
+$primary->backup($backup_name);
+
+# Create streaming standby
+# standby_1 -> primary
+# standby_2 -> standby_1
+my $node_standby_1 = get_new_node('standby_1');
+$node_standby_1->init_from_backup($primary, $backup_name,
+        has_streaming => 1);
+$node_standby_1->start;
+$node_standby_1->backup($backup_name);
+
+my $node_standby_2 = get_new_node('standby_2');
+$node_standby_2->init_from_backup($node_standby_1, $backup_name,
+        has_streaming => 1);
+$node_standby_2->start;
+$node_standby_2->backup($backup_name);
+
+# Below testcase verify MASTER/SLAVE  replication working fine and validate data
+my $data_check = $primary->safe_psql('postgres', 'select pg_is_in_recovery();');
+is($data_check, 'f', 'Master: Streaming Replication is working fine.');
+
+my $data_check = $node_standby_1->safe_psql('postgres', 'select pg_is_in_recovery();');
+is($data_check, 't', 'Slave: Streaming Replication is working fine.');
+
+my $data_check = $node_standby_2->safe_psql('postgres', 'select pg_is_in_recovery();');
+is($data_check, 't', 'Slave: Streaming Replication is working fine.');
+
+my $data_check = $node_standby_1->safe_psql('postgres', 'SELECT count(a) FROM test_tbl2;');
+is($data_check, '12', 'Check the streamed content on node_standby_1');
+
+# Below testcase verify an user in standby "cannot execute pg_prohibit_wal() during recovery"
+my ($stdout, $stderr, $timed_out);
+$node_standby_1->psql('postgres', 'SELECT pg_prohibit_wal(TRUE);',
+          stdout => \$stdout, stderr => \$stderr);
+is($stderr, "psql:<stdin>:1: ERROR:  cannot execute pg_prohibit_wal() during recovery",
+                                    "cannot execute pg_prohibit_wal() during recovery");
+
+# stop server
+$node_standby_1->stop;
+$node_standby_2->stop;
+$primary->stop;

#143

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#141)

7 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is the rebase version on top of the latest master head
includes refactoring patches posted by Robert.

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

Unfortunately, I didn't get much time to think about this and don't
have a strong opinion on it either.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

Yes, my next plan is to work on the TAP tests and look into the patch
posted by Prabhat to improve test coverage.

Regards,
Amul Sul

Attachments:

v31-0007-Documentation.patchapplication/octet-stream; name=v31-0007-Documentation.patchDownload

From ae6b9d8b7e58bc4a164ccaa2edececaa49e90fb0 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v31 7/8] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbeb..0ea08ee91ac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25224,9 +25224,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25343,6 +25343,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 22af7dbf51b..89da3eb2e94 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v31-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v31-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 97abdff64d391edb37feda54e1bd368ed1318de4 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v31 6/8] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 +++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 +++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 14 ++++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 47 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 +++++++++++++
 39 files changed, 501 insertions(+), 67 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..8c672770e79 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 404f2b62212..3b920c76936 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b7300253566..8ddae82b57d 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39b..2945ea4b6ba 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..0cb9adf8b5d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2c04b69221f..906e586dd2d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1345,6 +1346,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1360,8 +1366,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1967,8 +1972,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1993,7 +2003,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2425,6 +2435,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2435,6 +2446,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2465,7 +2479,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..b532b522275 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +486,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +500,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +534,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 271994b08df..99466b5a5a9 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6d3efb49a40..ee0dd764665 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2208,6 +2211,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2306,6 +2312,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index d7f8ffaa09c..9c77175090f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6141a3d0425..72c14cc3e9f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 37296c129f1..91d49a99192 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1056,7 +1056,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2907,9 +2907,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9419,6 +9421,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9584,6 +9589,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10245,7 +10252,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10259,10 +10266,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10284,8 +10291,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index e596a0470a9..6a5d6561439 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -138,9 +139,15 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -222,6 +229,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index f362b29a88a..c2ee6400c40 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -932,6 +932,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 33d99f604ad..c0f3037b24d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3892,13 +3892,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ff77a68552c..a4245aabe5c 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -56,4 +57,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 68d840d6996..a14b1f4559f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v31-0005-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v31-0005-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 1d614bf702aeee09866c7ed65006af5e509eff1c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v31 5/8] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 477 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 201 +++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  59 +++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 877 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..d7f8ffaa09c
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,477 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+				{
+					HoldWALProhibitStateTransition = true;
+					PerformPendingXLogAcceptWrites();
+					HoldWALProhibitStateTransition = false;
+				}
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 387f80419a5..6141a3d0425 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1f9128ec6fc..37296c129f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -231,9 +232,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -515,6 +517,9 @@ typedef enum ExclusiveBackupState
  * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
  * end-of-recovery record but don't need to checkpoint.
  *
+ * RECOVERY_XLOG_WRITE_CHECKPOINT means we need to write a checkpoint.
+ * This is only valid when the checkpointer itself wants a checkpoint.
+ *
  * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
  * checkpointer perform a checkpoint. This is only valid when the
  * checkpointer is running.
@@ -523,6 +528,7 @@ typedef enum
 {
 	RECOVERY_XLOG_NOTHING,
 	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_WRITE_CHECKPOINT,
 	RECOVERY_XLOG_REQUEST_CHECKPOINT
 } RecoveryXlogAction;
 
@@ -743,6 +749,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -911,6 +923,7 @@ static bool recoveryApplyDelay(XLogReaderState *record);
 static void SetLatestXTime(TimestampTz xtime);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void CheckRequiredParameterValues(void);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static void XLogReportParameters(void);
 static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 								TimeLineID prevTLI);
@@ -4991,6 +5004,17 @@ SetControlFileDBState(DBState state)
 	LWLockRelease(ControlFileLock);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5267,6 +5291,7 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6416,6 +6441,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6781,13 +6815,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8050,8 +8101,29 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Prepare to accept WAL writes. */
-	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8103,13 +8175,32 @@ XLogAcceptWrites(RecoveryXlogAction xlogaction,
 				 TimeLineID EndOfLogTLI,
 				 XLogRecPtr EndOfLog)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	XLogCtlInsert *Insert;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsUnderPostmaster);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
+	Insert = &XLogCtl->Insert;
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
@@ -8134,6 +8225,40 @@ XLogAcceptWrites(RecoveryXlogAction xlogaction,
 	 * commit timestamp.
 	 */
 	CompleteCommitTsInitialization();
+
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+}
+
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * EndOfLogTLI and EndOfLog input to the XLogAcceptWrites() requires when
+	 * the archive recovery is requested and we never reach with archive
+	 * recovery requested. If archive recovery is requested the system will be
+	 * taken out from the wal prohibited state and XLogAcceptWrites() operation
+	 * never skipped.
+	 */
+	XLogAcceptWrites(DetermineRecoveryXlogAction(), 0, InvalidXLogRecPtr);
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	SetControlFileDBState(DB_IN_PRODUCTION);
 }
 
 /*
@@ -8251,6 +8376,12 @@ DetermineRecoveryXlogAction(void)
 		LocalPromoteIsTriggered)
 		return RECOVERY_XLOG_WRITE_END_OF_RECOVERY;
 
+	/*
+	 * Stright away write a checkpoint if it is a checkpointer process.
+	 */
+	if (AmCheckpointerProcess())
+		return RECOVERY_XLOG_WRITE_CHECKPOINT;
+
 	/*
 	 * We decided against writing only an end-of-recovery record, and we know
 	 * that the postmaster was told to launch the checkpointer, so just
@@ -8283,6 +8414,11 @@ PerformRecoveryXLogAction(RecoveryXlogAction action)
 			CreateEndOfRecoveryRecord();
 			break;
 
+		case RECOVERY_XLOG_WRITE_CHECKPOINT:
+			/* Full checkpoint, when checkpointer calling this. */
+			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+			break;
+
 		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
 			/* Full checkpoint, when checkpointer is running. */
 			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
@@ -8409,9 +8545,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8430,9 +8566,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8454,6 +8601,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8743,9 +8896,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8758,6 +8915,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9007,8 +9167,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a416e94d371..0934478188e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -699,6 +699,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 912ef9cb54c..19d10147314 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -701,10 +701,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bc9ac7ccfaf..f362b29a88a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -348,6 +349,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -696,6 +698,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1343,3 +1348,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 364654e1060..c5d8edd82bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 4a2ed414b00..06f8c9569f0 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 27fbf1f3aae..083555aedfe 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ef7e6bfb779..85a12fc580a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a2e0f8de7e7..5044285c2b8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -234,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -674,6 +676,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2116,6 +2119,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12536,4 +12551,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ff77a68552c
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,59 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4f8b3e31ab7..b16e2682e75 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -134,6 +134,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -282,6 +290,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -290,6 +299,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -301,7 +311,9 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void SetControlFileDBState(DBState state);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b603700ed9d..64c9020a77b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11645,6 +11645,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6007827b445..43e826ceeb3 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -225,7 +225,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 37cf4b2f76b..107fccc6e8a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2824,6 +2824,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v31-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/octet-stream; name=v31-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From c38164fb91621bd86809513d3a83d415080480db Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 15:37:53 -0400
Subject: [PATCH v31 3/8] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.
---
 src/backend/access/transam/xlog.c | 75 +++++++++++++++++++------------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cf77faa98cc..bcf6a7e234d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -948,6 +948,9 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static void XLogAcceptWrites(RecoveryXlogAction xlogaction,
+							 TimeLineID EndOfLogTLI,
+							 XLogRecPtr EndOfLog);
 static RecoveryXlogAction DetermineRecoveryXlogAction(void);
 static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
@@ -8035,35 +8038,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8107,6 +8083,47 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static void
+XLogAcceptWrites(RecoveryXlogAction xlogaction,
+				 TimeLineID EndOfLogTLI,
+				 XLogRecPtr EndOfLog)
+{
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v31-0004-Refactor-add-function-to-set-database-state-in-c.patchapplication/octet-stream; name=v31-0004-Refactor-add-function-to-set-database-state-in-c.patchDownload

From 22da9cf17731070b927eebce46b6a967d15e3019 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 16 Jun 2021 09:02:24 -0400
Subject: [PATCH v31 4/8] Refactor: add function to set database state in
 control file

====
TODO:
====
 - The same code updating database state exists in StartupXLOG() but
   not sure do we need to optimize that since that code update
   SharedRecoveryState while holding ControlFileLock.
---
 src/backend/access/transam/xlog.c | 31 ++++++++++++++++---------------
 src/include/access/xlog.h         |  2 ++
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bcf6a7e234d..1f9128ec6fc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -38,7 +38,6 @@
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
-#include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
@@ -4979,6 +4978,19 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/*
+ * Set ControlFile's database state
+ */
+void
+SetControlFileDBState(DBState state)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = state;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -9024,13 +9036,7 @@ CreateCheckPoint(int flags)
 	START_CRIT_SECTION();
 
 	if (shutdown)
-	{
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
-		ControlFile->time = (pg_time_t) time(NULL);
-		UpdateControlFile();
-		LWLockRelease(ControlFileLock);
-	}
+		SetControlFileDBState(DB_SHUTDOWNING);
 
 	/*
 	 * Let smgr prepare for checkpoint; this has to happen before we determine
@@ -9579,13 +9585,8 @@ CreateRestartPoint(int flags)
 
 		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
-		{
-			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-			ControlFile->time = (pg_time_t) time(NULL);
-			UpdateControlFile();
-			LWLockRelease(ControlFileLock);
-		}
+			SetControlFileDBState(DB_SHUTDOWNED_IN_RECOVERY);
+
 		return false;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700de..4f8b3e31ab7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -15,6 +15,7 @@
 #include "access/xlogdefs.h"
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -300,6 +301,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileDBState(DBState state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
-- 
2.18.0

v31-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/octet-stream; name=v31-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From fb68c0b404bb1f1f22f8bca5153d6d880ea666fc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH v31 2/8] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, we issued XLOG_FPW_CHANGE and either
XLOG_CHECKPOINT_SHUTDOWN or XLOG_END_OF_RECOVERY while still
technically in recovery, and also performed post-archive-recovery
cleanup steps at that point. Postpone that stuff until after we clear
InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.
---
 src/backend/access/transam/xlog.c | 34 ++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6382b5cc932..cf77faa98cc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7978,22 +7978,11 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Figure out what xlog activity is needed to mark end of recovery. We
+	 * must make this determination before setting InRecovery = false, or
+	 * we'll get the wrong answer.
 	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	xlogaction = DetermineRecoveryXlogAction();
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8046,6 +8035,23 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

v31-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/octet-stream; name=v31-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 10b0b9aa81523623f2d215cad8ea04711913479a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v31 1/8] Refactor some end-of-recovery code out of
 StartupXLOG().

Split the code that performs whether to write a checkpoint or an
end-of-recovery record into DetermineRecoveryXlogAction(), which
decides what to do, and PerformRecoveryXlogAction(). Right now
these are always called one after the other, but further refactoring
is planned which will separate them.

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.
---
 src/backend/access/transam/xlog.c | 300 ++++++++++++++++++------------
 1 file changed, 179 insertions(+), 121 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b39a2fdaa5..6382b5cc932 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -506,6 +506,27 @@ typedef enum ExclusiveBackupState
 	EXCLUSIVE_BACKUP_STOPPING
 } ExclusiveBackupState;
 
+/*
+ * What should we do when we reach the end of REDO to ensure that we'll
+ * be able to recover properly if we crash again?
+ *
+ * RECOVERY_XLOG_NOTHING means we didn't actually REDO anything and therefore
+ * no action is required.
+ *
+ * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
+ * end-of-recovery record but don't need to checkpoint.
+ *
+ * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
+ * checkpointer perform a checkpoint. This is only valid when the
+ * checkpointer is running.
+ */
+typedef enum
+{
+	RECOVERY_XLOG_NOTHING,
+	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_REQUEST_CHECKPOINT
+} RecoveryXlogAction;
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -880,6 +901,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -925,6 +948,8 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static RecoveryXlogAction DetermineRecoveryXlogAction(void);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5694,6 +5719,94 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * The archive recovery request can be only handle in a startup process or
+	 * single backend process.
+	 */
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be
+	 * pre-allocated files containing garbage. In any case, they are not part
+	 * of the new timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because
+	 * it was hit by a meteor), it will never make it to the archive. That's
+	 * OK from our point of view, because the new segment that we created with
+	 * the new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically
+	 * present in the new file with new TLI, but recovery won't look there
+	 * when it's recovering to the older timeline. On the other hand, if we
+	 * archive the partial segment, and the original server on that timeline
+	 * is still running and archives the completed version of the same segment
+	 * later, it will fail. (We used to do that in 9.4 and below, and it
+	 * caused such problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix,
+	 * and archive it. Archive recovery will never try to read .partial
+	 * segments, so they will normally go unused. But in the odd PITR case,
+	 * the administrator can copy them manually to the pg_wal directory
+	 * (removing the suffix). They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let
+	 * it be archived normally. (In particular, if it was restored from the
+	 * archive to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -6503,7 +6616,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	RecoveryXlogAction xlogaction;
 	struct stat st;
 
 	/*
@@ -7874,127 +7987,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	xlogaction = DetermineRecoveryXlogAction();
+	PerformRecoveryXLogAction(xlogaction);
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8098,7 +8097,7 @@ StartupXLOG(void)
 	 * and in case of a crash, recovering from it might take a longer than is
 	 * appropriate now that we're not in standby mode anymore.
 	 */
-	if (promoted)
+	if (xlogaction == RECOVERY_XLOG_WRITE_END_OF_RECOVERY)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8198,6 +8197,65 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Determine what needs to be done upon completing REDO.
+ */
+static RecoveryXlogAction
+DetermineRecoveryXlogAction(void)
+{
+	/* No REDO, hence no action required. */
+	if (!InRecovery)
+		return RECOVERY_XLOG_NOTHING;
+
+	/*
+	 * In promotion, only create a lightweight end-of-recovery record instead
+	 * of a full checkpoint. A checkpoint is requested later, after we're
+	 * fully out of recovery mode and already accepting WAL writes.
+	 */
+	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		LocalPromoteIsTriggered)
+		return RECOVERY_XLOG_WRITE_END_OF_RECOVERY;
+
+	/*
+	 * We decided against writing only an end-of-recovery record, and we know
+	 * that the postmaster was told to launch the checkpointer, so just
+	 * request a checkpoint.
+	 */
+	return RECOVERY_XLOG_REQUEST_CHECKPOINT;
+}
+
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static void
+PerformRecoveryXLogAction(RecoveryXlogAction action)
+{
+	switch (action)
+	{
+		case RECOVERY_XLOG_NOTHING:
+			/* No REDO performed, hence nothing to do. */
+			break;
+
+		case RECOVERY_XLOG_WRITE_END_OF_RECOVERY:
+			/* Lightweight end-of-recovery record in lieu of checkpoint. */
+			CreateEndOfRecoveryRecord();
+			break;
+
+		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
+			/* Full checkpoint, when checkpointer is running. */
+			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+							  CHECKPOINT_IMMEDIATE |
+							  CHECKPOINT_WAIT);
+	}
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

#144

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Amul Sul (#143)

9 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is the rebased version for the latest master head. Also,
added tap tests to test some part of this feature and a separate patch
to test recovery_end_command execution.

I have also been through Prabhat's patch which helps me to write
current tests, but I am not sure about the few basic tests that he
included in the tap test which can be done using pg_regress otherwise,
e.g. checking permission to execute the pg_prohibit_wal() function.
Those basic tests I am yet to add, is it ok to add those tests in
pg_regress instead of TAP? The problem I see is that all the tests
covering a feature will not be together, which I think is not correct.

What is usual practice, can have a few tests in TAP and a few in
pg_regress for the same feature?

Regards,
Amul

Show quoted text

On Wed, Aug 4, 2021 at 6:26 PM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebase version on top of the latest master head
includes refactoring patches posted by Robert.

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

Unfortunately, I didn't get much time to think about this and don't
have a strong opinion on it either.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

Yes, my next plan is to work on the TAP tests and look into the patch
posted by Prabhat to improve test coverage.

Regards,
Amul Sul

Attachments:

v32-0004-Refactor-add-function-to-set-database-state-in-c.patchapplication/x-patch; name=v32-0004-Refactor-add-function-to-set-database-state-in-c.patchDownload

From ec67b1ba8da378cde7a7ea73da017dfbdf48b8ca Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 16 Jun 2021 09:02:24 -0400
Subject: [PATCH v32 4/9] Refactor: add function to set database state in
 control file

====
TODO:
====
 - The same code updating database state exists in StartupXLOG() but
   not sure do we need to optimize that since that code update
   SharedRecoveryState while holding ControlFileLock.
---
 src/backend/access/transam/xlog.c | 31 ++++++++++++++++---------------
 src/include/access/xlog.h         |  2 ++
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 15ace3230e0..1d0e9059420 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -38,7 +38,6 @@
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
-#include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
@@ -5155,6 +5154,19 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/*
+ * Set ControlFile's database state
+ */
+void
+SetControlFileDBState(DBState state)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = state;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -9231,13 +9243,7 @@ CreateCheckPoint(int flags)
 	START_CRIT_SECTION();
 
 	if (shutdown)
-	{
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
-		ControlFile->time = (pg_time_t) time(NULL);
-		UpdateControlFile();
-		LWLockRelease(ControlFileLock);
-	}
+		SetControlFileDBState(DB_SHUTDOWNING);
 
 	/*
 	 * Let smgr prepare for checkpoint; this has to happen before we determine
@@ -9786,13 +9792,8 @@ CreateRestartPoint(int flags)
 
 		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
-		{
-			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-			ControlFile->time = (pg_time_t) time(NULL);
-			UpdateControlFile();
-			LWLockRelease(ControlFileLock);
-		}
+			SetControlFileDBState(DB_SHUTDOWNED_IN_RECOVERY);
+
 		return false;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6b6ae81c2d5..3f6e8a997cf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -15,6 +15,7 @@
 #include "access/xlogdefs.h"
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -300,6 +301,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileDBState(DBState state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
-- 
2.18.0

v32-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v32-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 91b9a062b19d3c8fb10fdd8a2342c8d1a6b35774 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 15:37:53 -0400
Subject: [PATCH v32 3/9] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.
---
 src/backend/access/transam/xlog.c | 75 +++++++++++++++++++------------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e2214e98a76..15ace3230e0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -961,6 +961,9 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static void XLogAcceptWrites(RecoveryXlogAction xlogaction,
+							 TimeLineID EndOfLogTLI,
+							 XLogRecPtr EndOfLog);
 static RecoveryXlogAction DetermineRecoveryXlogAction(void);
 static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
@@ -8242,35 +8245,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8314,6 +8290,47 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static void
+XLogAcceptWrites(RecoveryXlogAction xlogaction,
+				 TimeLineID EndOfLogTLI,
+				 XLogRecPtr EndOfLog)
+{
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v32-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v32-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From 5ac4dc904b81d83ba4903e3c80820b9a58776e77 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH v32 2/9] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, we issued XLOG_FPW_CHANGE and either
XLOG_CHECKPOINT_SHUTDOWN or XLOG_END_OF_RECOVERY while still
technically in recovery, and also performed post-archive-recovery
cleanup steps at that point. Postpone that stuff until after we clear
InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.
---
 src/backend/access/transam/xlog.c | 34 ++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c36334b7e04..e2214e98a76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8185,22 +8185,11 @@ StartupXLOG(void)
 	}
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Figure out what xlog activity is needed to mark end of recovery. We
+	 * must make this determination before setting InRecovery = false, or
+	 * we'll get the wrong answer.
 	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	xlogaction = DetermineRecoveryXlogAction();
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8253,6 +8242,23 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

v32-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v32-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 502dd553093bfdb415f81eaaecd8a36ee1e15751 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v32 1/9] Refactor some end-of-recovery code out of
 StartupXLOG().

Split the code that performs whether to write a checkpoint or an
end-of-recovery record into DetermineRecoveryXlogAction(), which
decides what to do, and PerformRecoveryXlogAction(). Right now
these are always called one after the other, but further refactoring
is planned which will separate them.

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.
---
 src/backend/access/transam/xlog.c | 300 ++++++++++++++++++------------
 1 file changed, 179 insertions(+), 121 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03ec..c36334b7e04 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -506,6 +506,27 @@ typedef enum ExclusiveBackupState
 	EXCLUSIVE_BACKUP_STOPPING
 } ExclusiveBackupState;
 
+/*
+ * What should we do when we reach the end of REDO to ensure that we'll
+ * be able to recover properly if we crash again?
+ *
+ * RECOVERY_XLOG_NOTHING means we didn't actually REDO anything and therefore
+ * no action is required.
+ *
+ * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
+ * end-of-recovery record but don't need to checkpoint.
+ *
+ * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
+ * checkpointer perform a checkpoint. This is only valid when the
+ * checkpointer is running.
+ */
+typedef enum
+{
+	RECOVERY_XLOG_NOTHING,
+	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_REQUEST_CHECKPOINT
+} RecoveryXlogAction;
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -892,6 +913,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -938,6 +961,8 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static RecoveryXlogAction DetermineRecoveryXlogAction(void);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5878,6 +5903,94 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * The archive recovery request can be only handle in a startup process or
+	 * single backend process.
+	 */
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be
+	 * pre-allocated files containing garbage. In any case, they are not part
+	 * of the new timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because
+	 * it was hit by a meteor), it will never make it to the archive. That's
+	 * OK from our point of view, because the new segment that we created with
+	 * the new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically
+	 * present in the new file with new TLI, but recovery won't look there
+	 * when it's recovering to the older timeline. On the other hand, if we
+	 * archive the partial segment, and the original server on that timeline
+	 * is still running and archives the completed version of the same segment
+	 * later, it will fail. (We used to do that in 9.4 and below, and it
+	 * caused such problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix,
+	 * and archive it. Archive recovery will never try to read .partial
+	 * segments, so they will normally go unused. But in the odd PITR case,
+	 * the administrator can copy them manually to the pg_wal directory
+	 * (removing the suffix). They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let
+	 * it be archived normally. (In particular, if it was restored from the
+	 * archive to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname, true);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -6696,7 +6809,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	RecoveryXlogAction xlogaction;
 	struct stat st;
 
 	/*
@@ -8081,127 +8194,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	xlogaction = DetermineRecoveryXlogAction();
+	PerformRecoveryXLogAction(xlogaction);
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname, true);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8305,7 +8304,7 @@ StartupXLOG(void)
 	 * and in case of a crash, recovering from it might take a longer than is
 	 * appropriate now that we're not in standby mode anymore.
 	 */
-	if (promoted)
+	if (xlogaction == RECOVERY_XLOG_WRITE_END_OF_RECOVERY)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8405,6 +8404,65 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Determine what needs to be done upon completing REDO.
+ */
+static RecoveryXlogAction
+DetermineRecoveryXlogAction(void)
+{
+	/* No REDO, hence no action required. */
+	if (!InRecovery)
+		return RECOVERY_XLOG_NOTHING;
+
+	/*
+	 * In promotion, only create a lightweight end-of-recovery record instead
+	 * of a full checkpoint. A checkpoint is requested later, after we're
+	 * fully out of recovery mode and already accepting WAL writes.
+	 */
+	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		LocalPromoteIsTriggered)
+		return RECOVERY_XLOG_WRITE_END_OF_RECOVERY;
+
+	/*
+	 * We decided against writing only an end-of-recovery record, and we know
+	 * that the postmaster was told to launch the checkpointer, so just
+	 * request a checkpoint.
+	 */
+	return RECOVERY_XLOG_REQUEST_CHECKPOINT;
+}
+
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static void
+PerformRecoveryXLogAction(RecoveryXlogAction action)
+{
+	switch (action)
+	{
+		case RECOVERY_XLOG_NOTHING:
+			/* No REDO performed, hence nothing to do. */
+			break;
+
+		case RECOVERY_XLOG_WRITE_END_OF_RECOVERY:
+			/* Lightweight end-of-recovery record in lieu of checkpoint. */
+			CreateEndOfRecoveryRecord();
+			break;
+
+		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
+			/* Full checkpoint, when checkpointer is running. */
+			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+							  CHECKPOINT_IMMEDIATE |
+							  CHECKPOINT_WAIT);
+	}
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

v32-0009-Test-check-recovery_end_command-execution.patchapplication/x-patch; name=v32-0009-Test-check-recovery_end_command-execution.patchDownload

From 653f1f90ebb72270c852cfea11c9a7af008e8886 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 31 Aug 2021 07:38:38 -0400
Subject: [PATCH v32 9/9] Test: check recovery_end_command execution.

---
 src/test/recovery/t/002_archiving.pl | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index ce60159f036..905fff33cb7 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 use File::Copy;
 
 # Initialize primary node, doing archives
@@ -65,6 +65,13 @@ $node_standby->promote;
 my $node_standby2 = PostgresNode->new('standby2');
 $node_standby2->init_from_backup($node_primary, $backup_name,
 	has_restoring => 1);
+
+# Also, test recovery_end_command by creating empty file.
+my $tmp_path = TestLib::perl2host(TestLib::tempdir_short());
+my $recovery_end_command_file = "$tmp_path/recovery_end_command.done";
+$node_standby2->append_conf('postgresql.conf',
+	"recovery_end_command='touch $recovery_end_command_file'");
+
 $node_standby2->start;
 
 # Now promote standby2, and check that temporary files specifically
@@ -75,3 +82,7 @@ ok( !-f "$node_standby2_data/pg_wal/RECOVERYHISTORY",
 	"RECOVERYHISTORY removed after promotion");
 ok( !-f "$node_standby2_data/pg_wal/RECOVERYXLOG",
 	"RECOVERYXLOG removed after promotion");
+
+# Also, check recovery_end_command execution.
+ok(-f "$recovery_end_command_file",
+	'recovery_end_command executed after promotion');
-- 
2.18.0

v32-0007-Documentation.patchapplication/x-patch; name=v32-0007-Documentation.patchDownload

From 18cd1b5316509e31355b5acc883f78da2683394f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v32 7/9] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbeb..0ea08ee91ac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25224,9 +25224,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25343,6 +25343,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 22af7dbf51b..89da3eb2e94 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v32-0008-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/x-patch; name=v32-0008-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From daa8126c90e7821f4f5de08926fb25d4a2cd98d3 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v32 8/9] Test: Few tap tests for wal prohibited system

Does following testing:

1. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
2. Standby server cannot be in wal prohibited, standby.signal or
   recovery.signal take out system from wal prohibited state.
3. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
---
 src/test/recovery/t/026_pg_prohibit_wal.pl | 134 +++++++++++++++++++++
 1 file changed, 134 insertions(+)
 create mode 100644 src/test/recovery/t/026_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/026_pg_prohibit_wal.pl b/src/test/recovery/t/026_pg_prohibit_wal.pl
new file mode 100644
index 00000000000..6abd04935fc
--- /dev/null
+++ b/src/test/recovery/t/026_pg_prohibit_wal.pl
@@ -0,0 +1,134 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test for pg_prohibit_wal().
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize primary node
+my $node_primary = PostgresNode->new('primary');
+my $port = $node_primary->port();
+$node_primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$node_primary->start;
+
+# Create table and insert some data
+$node_primary->safe_psql('postgres', 'CREATE TABLE tab(i int)');
+$node_primary->safe_psql('postgres',
+	'INSERT INTO tab VALUES(generate_series(1,5))');
+
+# Change primary to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'on', 'primary server is now wal prohibited');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'on', 'primary server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on primary, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint on primary, last checkpoint lsn : $checkpoint_lsn");
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgresNode->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Standby server cannot be put into wal prohibited state.
+my ($stdout, $stderr, $timed_out);
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+is($new_checkpoint_lsn == $checkpoint_lsn, 1,
+	"new implicit checkpoint performed on primary, new checkpoint lsn : $new_checkpoint_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '6',
+	'insert passed on primary');
+
+# Wait until necessary replay has been done on standby
+my $current_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+my $caughtup_query =
+  "SELECT '$current_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for standby to catch up";
+
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '6',
+	'new insert replicated on standby as well');
+
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v32-0005-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v32-0005-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From a1d406c3689ae34b6c3665ac79f5a6a5ffba7236 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v32 5/9] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 477 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 201 +++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  59 +++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 877 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..d7f8ffaa09c
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,477 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+				{
+					HoldWALProhibitStateTransition = true;
+					PerformPendingXLogAcceptWrites();
+					HoldWALProhibitStateTransition = false;
+				}
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 387f80419a5..6141a3d0425 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1d0e9059420..e69a213aa98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -231,9 +232,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -515,6 +517,9 @@ typedef enum ExclusiveBackupState
  * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
  * end-of-recovery record but don't need to checkpoint.
  *
+ * RECOVERY_XLOG_WRITE_CHECKPOINT means we need to write a checkpoint.
+ * This is only valid when the checkpointer itself wants a checkpoint.
+ *
  * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
  * checkpointer perform a checkpoint. This is only valid when the
  * checkpointer is running.
@@ -523,6 +528,7 @@ typedef enum
 {
 	RECOVERY_XLOG_NOTHING,
 	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_WRITE_CHECKPOINT,
 	RECOVERY_XLOG_REQUEST_CHECKPOINT
 } RecoveryXlogAction;
 
@@ -743,6 +749,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 
 	/*
@@ -923,6 +935,7 @@ static bool recoveryApplyDelay(XLogReaderState *record);
 static void SetLatestXTime(TimestampTz xtime);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void CheckRequiredParameterValues(void);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static void XLogReportParameters(void);
 static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 								TimeLineID prevTLI);
@@ -5167,6 +5180,17 @@ SetControlFileDBState(DBState state)
 	LWLockRelease(ControlFileLock);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5443,6 +5467,7 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6609,6 +6634,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6974,13 +7008,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8257,8 +8308,29 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Prepare to accept WAL writes. */
-	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8310,13 +8382,32 @@ XLogAcceptWrites(RecoveryXlogAction xlogaction,
 				 TimeLineID EndOfLogTLI,
 				 XLogRecPtr EndOfLog)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	XLogCtlInsert *Insert;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsUnderPostmaster);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
+	Insert = &XLogCtl->Insert;
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
@@ -8341,6 +8432,40 @@ XLogAcceptWrites(RecoveryXlogAction xlogaction,
 	 * commit timestamp.
 	 */
 	CompleteCommitTsInitialization();
+
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+}
+
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * EndOfLogTLI and EndOfLog input to the XLogAcceptWrites() requires when
+	 * the archive recovery is requested and we never reach with archive
+	 * recovery requested. If archive recovery is requested the system will be
+	 * taken out from the wal prohibited state and XLogAcceptWrites() operation
+	 * never skipped.
+	 */
+	XLogAcceptWrites(DetermineRecoveryXlogAction(), 0, InvalidXLogRecPtr);
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	SetControlFileDBState(DB_IN_PRODUCTION);
 }
 
 /*
@@ -8445,6 +8570,12 @@ CheckRecoveryConsistency(void)
 static RecoveryXlogAction
 DetermineRecoveryXlogAction(void)
 {
+	/*
+	 * Stright away write a checkpoint if it is a checkpointer process.
+	 */
+	if (AmCheckpointerProcess())
+		return RECOVERY_XLOG_WRITE_CHECKPOINT;
+
 	/* No REDO, hence no action required. */
 	if (!InRecovery)
 		return RECOVERY_XLOG_NOTHING;
@@ -8490,6 +8621,11 @@ PerformRecoveryXLogAction(RecoveryXlogAction action)
 			CreateEndOfRecoveryRecord();
 			break;
 
+		case RECOVERY_XLOG_WRITE_CHECKPOINT:
+			/* Full checkpoint, when checkpointer calling this. */
+			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+			break;
+
 		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
 			/* Full checkpoint, when checkpointer is running. */
 			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
@@ -8616,9 +8752,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8637,9 +8773,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8661,6 +8808,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8950,9 +9103,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8965,6 +9122,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9214,8 +9374,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a416e94d371..0934478188e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -699,6 +699,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 3b3df8fa8cc..9ea299adbc1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..d3e3e156686 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -348,6 +349,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -692,6 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1341,3 +1346,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97e..fd3ffc80557 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -223,6 +224,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 364654e1060..c5d8edd82bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 4a2ed414b00..06f8c9569f0 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 27fbf1f3aae..083555aedfe 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ef7e6bfb779..85a12fc580a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 467b0fd6fe7..784197b7ded 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -234,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -674,6 +676,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2116,6 +2119,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12536,4 +12551,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ff77a68552c
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,59 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 3f6e8a997cf..96a3e02992e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -134,6 +134,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -282,6 +290,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -290,6 +299,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -301,7 +311,9 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void SetControlFileDBState(DBState state);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b603700ed9d..64c9020a77b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11645,6 +11645,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6007827b445..43e826ceeb3 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -225,7 +225,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 37cf4b2f76b..107fccc6e8a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2824,6 +2824,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v32-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v32-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 0334aa0324376ae5ea92913c9b72f18d45ede045 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v32 6/9] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 +++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 +++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 14 ++++++-
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 47 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 +++++++++++++
 39 files changed, 501 insertions(+), 67 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..8c672770e79 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 06c05865435..2b4812fb23a 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a4..0940b20c718 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a92..52b2b36e4a9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6ac..20ff6d051bf 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 404f2b62212..3b920c76936 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b7300253566..8ddae82b57d 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39b..2945ea4b6ba 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..0cb9adf8b5d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 334d8a2aa71..191af400719 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1347,6 +1348,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1362,8 +1368,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1970,8 +1975,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1996,7 +2006,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2428,6 +2438,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2438,6 +2449,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2468,7 +2482,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..b532b522275 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +486,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +500,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +534,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 271994b08df..99466b5a5a9 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c3..1519e4d233d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2212,6 +2215,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2310,6 +2316,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index d7f8ffaa09c..9c77175090f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6141a3d0425..72c14cc3e9f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e69a213aa98..309ed6c6d9b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1069,7 +1069,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2958,9 +2958,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9626,6 +9628,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9791,6 +9796,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10452,7 +10459,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10466,10 +10473,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10491,8 +10498,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index e596a0470a9..6a5d6561439 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -138,9 +139,15 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
 	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -222,6 +229,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d3e3e156686..b81187bf880 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -928,6 +928,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3b485de067f..a9a32e5e8bc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3879,13 +3879,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ff77a68552c..a4245aabe5c 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -56,4 +57,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e2e9a364a7..6d137bb007b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

#145

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#144)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Aug 31, 2021 at 8:16 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version for the latest master head. Also,
added tap tests to test some part of this feature and a separate patch
to test recovery_end_command execution.

It looks like you haven't given any thought to writing that in a way
that will work on Windows?

What is usual practice, can have a few tests in TAP and a few in
pg_regress for the same feature?

Sure, there's no problem with that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#146

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Amul Sul (#144)

Re: [Patch] ALTER SYSTEM READ ONLY

On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version for the latest master head.

Hi Amul!

Could you please rebase again?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#147

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Mark Dilger (#146)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger <mark.dilger@enterprisedb.com>
wrote:

On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version for the latest master head.

Hi Amul!

Could you please rebase again?

Ok will do that tomorrow, thanks.

Regards,
Amul

#148

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Amul Sul (#147)

8 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Sep 7, 2021 at 10:02 PM Amul Sul <sulamul@gmail.com> wrote:

On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version for the latest master head.

Hi Amul!

Could you please rebase again?

Ok will do that tomorrow, thanks.

Here is the rebased version. I have added a few more test cases,
perhaps needing more tests and optimization to it, that I'll try in
the next version. I dropped the patch for recovery_end_command
testing & will post that separately.

Regards,
Amul

Attachments:

v33-0008-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/x-patch; name=v33-0008-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From 4d660610d36b314e8e3106beb2c93b072ee505a2 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v33 8/8] Test: Few tap tests for wal prohibited system

Does following testing:

1. Basic verification like insert into normal and unlogged table on
   wal prohibited system.
2. Check permission to non-superuser to alter wal prohibited system
   state.
3. Verify open write transaction disconnection when system state has
   been changed to wal prohibited.
4. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
5. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
6. Standby server cannot be in wal prohibited, standby.signal and/or
   recovery.signal take out system from wal prohibited state.
---
 src/test/recovery/t/026_pg_prohibit_wal.pl | 213 +++++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100644 src/test/recovery/t/026_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/026_pg_prohibit_wal.pl b/src/test/recovery/t/026_pg_prohibit_wal.pl
new file mode 100644
index 00000000000..4974059aa9a
--- /dev/null
+++ b/src/test/recovery/t/026_pg_prohibit_wal.pl
@@ -0,0 +1,213 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test wal prohibited state.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Config;
+use Test::More tests => 22;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize database node
+my $node_primary = PostgresNode->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Create few tables and insert some data
+$node_primary->safe_psql('postgres',  <<EOSQL);
+CREATE TABLE tab AS SELECT i FROM generate_series(1,5) i;
+CREATE UNLOGGED TABLE unlogtab AS SELECT i FROM generate_series(1,5) i;
+EOSQL
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is now wal prohibited');
+
+#
+# In wal prohibited state, further table insert will fail.
+#
+# Note that even though inter into unlogged and temporary table doesn't generate
+# wal but the transaction does that insert operation will acquire transaction id
+# which is not allowed on wal prohibited system. Also, that transaction's abort
+# or commit state will be wal logged at the end which is prohibited as well.
+#
+my ($stdout, $stderr, $timed_out);
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, table insert is failed');
+$node_primary->psql('postgres', 'INSERT INTO unlogtab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, unlogged table insert is failed');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on server, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint, last checkpoint lsn : $checkpoint_lsn");
+
+# Change server to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+is($new_checkpoint_lsn == $checkpoint_lsn, 1,
+	"new checkpoint performed, new checkpoint lsn : $new_checkpoint_lsn");
+
+my $new_write_lsn = $node_primary->lsn('write');
+is($new_write_lsn == $write_lsn, 1,
+	"new wal writes on server, new latest wal write lsn : $new_write_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '6',
+	'table insert passed');
+
+# Only the superuser and the user who granted permission able to call
+# pg_prohibit_wal to change wal prohibited state.
+$node_primary->safe_psql('postgres', 'CREATE USER non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+like($stderr, qr/permission denied for function pg_prohibit_wal/,
+	'permission denied to non-superuser for alter wal prohibited state');
+$node_primary->safe_psql('postgres', 'GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'granted permission to non-superuser, able to alter wal prohibited state');
+
+# back to normal state
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(false)');
+
+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$mysession_stdin,
+	'>',
+	\$mysession_stdout,
+	'2>',
+	\$mysession_stderr,
+	$psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(7);
+SELECT $$value-7-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-7-inserted-into-tab/,
+	'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+	'session with open write transaction is terminated');
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgresNode->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Recovery server cannot be put into wal prohibited state.
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '7',
+	'insert passed on primary');
+
+# Wait for standbys to catch up
+$node_primary->wait_for_catchup($node_standby, 'write');
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '7',
+	'new insert replicated on standby as well');
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v33-0004-Refactor-add-function-to-set-database-state-in-c.patchapplication/x-patch; name=v33-0004-Refactor-add-function-to-set-database-state-in-c.patchDownload

From ecedbdf7b038b453903299a8cd8b2759e06aab56 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 16 Jun 2021 09:02:24 -0400
Subject: [PATCH v33 4/8] Refactor: add function to set database state in
 control file

====
TODO:
====
 - The same code updating database state exists in StartupXLOG() but
   not sure do we need to optimize that since that code update
   SharedRecoveryState while holding ControlFileLock.
---
 src/backend/access/transam/xlog.c | 31 ++++++++++++++++---------------
 src/include/access/xlog.h         |  2 ++
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 453eb2700b5..1ec236352d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -38,7 +38,6 @@
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
-#include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
@@ -4979,6 +4978,19 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/*
+ * Set ControlFile's database state
+ */
+void
+SetControlFileDBState(DBState state)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = state;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -9036,13 +9048,7 @@ CreateCheckPoint(int flags)
 	START_CRIT_SECTION();
 
 	if (shutdown)
-	{
-		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
-		ControlFile->time = (pg_time_t) time(NULL);
-		UpdateControlFile();
-		LWLockRelease(ControlFileLock);
-	}
+		SetControlFileDBState(DB_SHUTDOWNING);
 
 	/*
 	 * Let smgr prepare for checkpoint; this has to happen before we determine
@@ -9591,13 +9597,8 @@ CreateRestartPoint(int flags)
 
 		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
-		{
-			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-			ControlFile->time = (pg_time_t) time(NULL);
-			UpdateControlFile();
-			LWLockRelease(ControlFileLock);
-		}
+			SetControlFileDBState(DB_SHUTDOWNED_IN_RECOVERY);
+
 		return false;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700de..4f8b3e31ab7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -15,6 +15,7 @@
 #include "access/xlogdefs.h"
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -300,6 +301,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void SetControlFileDBState(DBState state);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
-- 
2.18.0

v33-0005-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v33-0005-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From ec6bc95e815ab41f0d075fc20a116af15a00f7cb Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v33 5/8] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 477 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 201 +++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   9 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  20 +
 src/backend/storage/ipc/ipci.c           |   6 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  30 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   6 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  59 +++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 877 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..d7f8ffaa09c
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,477 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_VOID();	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_VOID();
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_VOID();		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+				{
+					HoldWALProhibitStateTransition = true;
+					PerformPendingXLogAcceptWrites();
+					HoldWALProhibitStateTransition = false;
+				}
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 387f80419a5..6141a3d0425 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1ec236352d5..bec829486b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -231,9 +232,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -515,6 +517,9 @@ typedef enum ExclusiveBackupState
  * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
  * end-of-recovery record but don't need to checkpoint.
  *
+ * RECOVERY_XLOG_WRITE_CHECKPOINT means we need to write a checkpoint.
+ * This is only valid when the checkpointer itself wants a checkpoint.
+ *
  * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
  * checkpointer perform a checkpoint. This is only valid when the
  * checkpointer is running.
@@ -523,6 +528,7 @@ typedef enum
 {
 	RECOVERY_XLOG_NOTHING,
 	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_WRITE_CHECKPOINT,
 	RECOVERY_XLOG_REQUEST_CHECKPOINT
 } RecoveryXlogAction;
 
@@ -743,6 +749,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -911,6 +923,7 @@ static bool recoveryApplyDelay(XLogReaderState *record);
 static void SetLatestXTime(TimestampTz xtime);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void CheckRequiredParameterValues(void);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static void XLogReportParameters(void);
 static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 								TimeLineID prevTLI);
@@ -4991,6 +5004,17 @@ SetControlFileDBState(DBState state)
 	LWLockRelease(ControlFileLock);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5267,6 +5291,7 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6428,6 +6453,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6793,13 +6827,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8062,8 +8113,29 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Prepare to accept WAL writes. */
-	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8115,13 +8187,32 @@ XLogAcceptWrites(RecoveryXlogAction xlogaction,
 				 TimeLineID EndOfLogTLI,
 				 XLogRecPtr EndOfLog)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	XLogCtlInsert *Insert;
+
+	/* Only Startup or checkpointer or standalone backend allowed to be here. */
+	Assert(AmStartupProcess() || AmCheckpointerProcess() ||
+		   !IsUnderPostmaster);
+
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
+	Insert = &XLogCtl->Insert;
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
@@ -8146,6 +8237,40 @@ XLogAcceptWrites(RecoveryXlogAction xlogaction,
 	 * commit timestamp.
 	 */
 	CompleteCommitTsInitialization();
+
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+}
+
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	ResetLocalXLogInsertAllowed();
+
+	/*
+	 * EndOfLogTLI and EndOfLog input to the XLogAcceptWrites() requires when
+	 * the archive recovery is requested and we never reach with archive
+	 * recovery requested. If archive recovery is requested the system will be
+	 * taken out from the wal prohibited state and XLogAcceptWrites() operation
+	 * never skipped.
+	 */
+	XLogAcceptWrites(DetermineRecoveryXlogAction(), 0, InvalidXLogRecPtr);
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	SetControlFileDBState(DB_IN_PRODUCTION);
 }
 
 /*
@@ -8250,6 +8375,12 @@ CheckRecoveryConsistency(void)
 static RecoveryXlogAction
 DetermineRecoveryXlogAction(void)
 {
+	/*
+	 * Stright away write a checkpoint if it is a checkpointer process.
+	 */
+	if (AmCheckpointerProcess())
+		return RECOVERY_XLOG_WRITE_CHECKPOINT;
+
 	/* No REDO, hence no action required. */
 	if (!InRecovery)
 		return RECOVERY_XLOG_NOTHING;
@@ -8295,6 +8426,11 @@ PerformRecoveryXLogAction(RecoveryXlogAction action)
 			CreateEndOfRecoveryRecord();
 			break;
 
+		case RECOVERY_XLOG_WRITE_CHECKPOINT:
+			/* Full checkpoint, when checkpointer calling this. */
+			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+			break;
+
 		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
 			/* Full checkpoint, when checkpointer is running. */
 			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
@@ -8421,9 +8557,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8442,9 +8578,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8466,6 +8613,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8755,9 +8908,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8770,6 +8927,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9019,8 +9179,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a416e94d371..0934478188e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -699,6 +699,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 3b3df8fa8cc..9ea299adbc1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,13 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the system is not read only i.e. wal writes
+		 * permitted.  Second, we need to make sure that there is a worker slot
+		 * available.  Third, we need to make sure that no other worker failed
+		 * while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..d3e3e156686 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -348,6 +349,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -692,6 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1341,3 +1346,18 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows a process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 13f3926ff67..1605e8bc26c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -247,6 +248,11 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up wal probibit shared state
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 364654e1060..c5d8edd82bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 4a2ed414b00..06f8c9569f0 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
+		 * As in ProcessSyncRequests, we don't want to stop wal prohibit change
 		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * It needs to be check and processed by checkpointer as soon as
+		 * possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop wal prohibit change requests for a long time when
+		 * there are many fsync requests to be processed.  It needs to be check
+		 * and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the wal prohibit
+				 * state change request check.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 27fbf1f3aae..083555aedfe 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ef7e6bfb779..85a12fc580a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,12 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE:
+			event_name = "SystemWALProhibitState";
+			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c339acf0670..24923f7e44e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -234,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -675,6 +677,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2117,6 +2120,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12548,4 +12563,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ff77a68552c
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,59 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4f8b3e31ab7..b16e2682e75 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -134,6 +134,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of work that enables wal writes */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped wal writes */
+	XLOG_ACCEPT_WRITES_DONE			/* wal writes are enabled */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -282,6 +290,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -290,6 +299,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
@@ -301,7 +311,9 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void SetControlFileDBState(DBState state);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532ec..390786caf7f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11645,6 +11645,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'void',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6007827b445..43e826ceeb3 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -225,7 +225,9 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 423780652fb..5d682883b44 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2826,6 +2826,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v33-0007-Documentation.patchapplication/x-patch; name=v33-0007-Documentation.patchDownload

From 7f4a97deaa50a12f6f1214b96c38ec7332852c02 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v33 7/8] Documentation.

---
 doc/src/sgml/func.sgml              | 26 +++++++++++--
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 4 files changed, 118 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbeb..0ea08ee91ac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25224,9 +25224,9 @@ SELECT collation for ('foo' COLLATE "de_DE");
    </para>
 
    <para>
-    Each of these functions returns <literal>true</literal> if
-    the signal was successfully sent and <literal>false</literal>
-    if sending the signal failed.
+    Except <function>pg_prohibit_wal</function>, each of these functions
+    returns <literal>true</literal> if the signal was successfully sent
+    and <literal>false</literal> if sending the signal failed.
    </para>
 
    <table id="functions-admin-signal-table">
@@ -25343,6 +25343,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c43f2140205..98b660941b1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v33-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v33-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 60c3d8c5c97b9423a5c703ed00ae625a7ac02c62 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v33 6/8] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 +++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 +++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 21 ++++++++--
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/postmaster/checkpointer.c     |  4 ++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 47 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 +++++++++++++
 39 files changed, 507 insertions(+), 68 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..8c672770e79 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index fbccf3d038d..de4b355b8f5 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
+			CheckWALPermitted();
 			computeLeafRecompressWALData(leaf);
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..53c156018e7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
+			CheckWALPermitted();
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..79312e5d2d0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6d2d71be32b..7b321c69880 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..3bb20b787ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Building indexes will have an XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index eb3810494f2..a47a3dd84cc 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b1..4ea7b1c934f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b312af57e11..95e0986130b 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
+		CheckWALPermitted();
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
+						CheckWALPermitted();
 						XLogEnsureRecordSpace(0, 3 + nitups);
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 159646c7c3e..d1989e93b35 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39b..2945ea4b6ba 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..0cb9adf8b5d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9eaf07649e8..c6809f5a3e5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1338,6 +1339,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1353,8 +1359,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1958,8 +1963,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1984,7 +1994,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2417,6 +2427,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2427,6 +2438,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2457,7 +2471,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..b532b522275 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -270,6 +272,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *) PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * Can reach here from VACUUM or from startup process, so need not have an
+	 * XID.
+	 *
+	 * Recovery in the startup process never have wal prohibit state, skip
+	 * permission check if reach here in the startup process.
+	 */
+	if (needwal)
+		InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
@@ -277,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +486,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +500,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +534,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 271994b08df..99466b5a5a9 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -229,6 +230,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6ac205c98ee..d1a51864aae 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c3..1519e4d233d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2212,6 +2215,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2310,6 +2316,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index d7f8ffaa09c..9c77175090f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6141a3d0425..72c14cc3e9f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bec829486b9..910e6f751a7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1056,7 +1056,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2907,9 +2907,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9431,6 +9433,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9596,6 +9601,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10257,7 +10264,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10271,10 +10278,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10296,8 +10303,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index e596a0470a9..53f1d9948db 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -138,9 +139,20 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
-	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
+	/*
+	 * Cross-check on whether we should be here or not.
+	 *
+	 * This check is primarily for a non-critical section that never insists the
+	 * same WAL write permission check before reaching here.
+	 */
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -222,6 +234,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d3e3e156686..b81187bf880 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -928,6 +928,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bc1753ae916..4d8cf5d1651 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3879,13 +3879,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ff77a68552c..a4245aabe5c 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -56,4 +57,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process never is in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e2e9a364a7..6d137bb007b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v33-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v33-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 256f7beb0a5c9977cb74fe7dc50fb9a79bc953bc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 15:37:53 -0400
Subject: [PATCH v33 3/8] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.
---
 src/backend/access/transam/xlog.c | 75 +++++++++++++++++++------------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1aa35c5644b..453eb2700b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -948,6 +948,9 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static void XLogAcceptWrites(RecoveryXlogAction xlogaction,
+							 TimeLineID EndOfLogTLI,
+							 XLogRecPtr EndOfLog);
 static RecoveryXlogAction DetermineRecoveryXlogAction(void);
 static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
@@ -8047,35 +8050,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	XLogAcceptWrites(xlogaction, EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8119,6 +8095,47 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static void
+XLogAcceptWrites(RecoveryXlogAction xlogaction,
+				 TimeLineID EndOfLogTLI,
+				 XLogRecPtr EndOfLog)
+{
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v33-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v33-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From edf016dbc29181518ec1d51140e4a634c7594e0b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH v33 2/8] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, we issued XLOG_FPW_CHANGE and either
XLOG_CHECKPOINT_SHUTDOWN or XLOG_END_OF_RECOVERY while still
technically in recovery, and also performed post-archive-recovery
cleanup steps at that point. Postpone that stuff until after we clear
InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.
---
 src/backend/access/transam/xlog.c | 34 ++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index faeebd74950..1aa35c5644b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7990,22 +7990,11 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Figure out what xlog activity is needed to mark end of recovery. We
+	 * must make this determination before setting InRecovery = false, or
+	 * we'll get the wrong answer.
 	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	xlogaction = DetermineRecoveryXlogAction();
-	PerformRecoveryXLogAction(xlogaction);
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8058,6 +8047,23 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	PerformRecoveryXLogAction(xlogaction);
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

v33-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v33-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From a6d84d7f6ee19910c1a043a52bf17f6b1c1d67d8 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v33 1/8] Refactor some end-of-recovery code out of
 StartupXLOG().

Split the code that performs whether to write a checkpoint or an
end-of-recovery record into DetermineRecoveryXlogAction(), which
decides what to do, and PerformRecoveryXlogAction(). Right now
these are always called one after the other, but further refactoring
is planned which will separate them.

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.
---
 src/backend/access/transam/xlog.c | 303 ++++++++++++++++++------------
 1 file changed, 182 insertions(+), 121 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749da..faeebd74950 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -506,6 +506,27 @@ typedef enum ExclusiveBackupState
 	EXCLUSIVE_BACKUP_STOPPING
 } ExclusiveBackupState;
 
+/*
+ * What should we do when we reach the end of REDO to ensure that we'll
+ * be able to recover properly if we crash again?
+ *
+ * RECOVERY_XLOG_NOTHING means we didn't actually REDO anything and therefore
+ * no action is required.
+ *
+ * RECOVERY_XLOG_WRITE_END_OF_RECOVERY means we need to write an
+ * end-of-recovery record but don't need to checkpoint.
+ *
+ * RECOVERY_XLOG_REQUEST_CHECKPOINT means we need a request that the
+ * checkpointer perform a checkpoint. This is only valid when the
+ * checkpointer is running.
+ */
+typedef enum
+{
+	RECOVERY_XLOG_NOTHING,
+	RECOVERY_XLOG_WRITE_END_OF_RECOVERY,
+	RECOVERY_XLOG_REQUEST_CHECKPOINT
+} RecoveryXlogAction;
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -880,6 +901,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -925,6 +948,8 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static RecoveryXlogAction DetermineRecoveryXlogAction(void);
+static void PerformRecoveryXLogAction(RecoveryXlogAction action);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5694,6 +5719,97 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * The archive recovery request can be only handle in a startup process or
+	 * single backend process.
+	 */
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old
+	 * timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline,
+	 * remove them. They might contain valid WAL, but they might also be
+	 * pre-allocated files containing garbage. In any case, they are not
+	 * part of the new timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with
+	 * the last, partial segment on the old timeline? If we don't archive
+	 * it, and the server that created the WAL never archives it either
+	 * (e.g. because it was hit by a meteor), it will never make it to the
+	 * archive. That's OK from our point of view, because the new segment
+	 * that we created with the new TLI contains all the WAL from the old
+	 * timeline up to the switch point. But if you later try to do PITR to
+	 * the "missing" WAL on the old timeline, recovery won't find it in
+	 * the archive. It's physically present in the new file with new TLI,
+	 * but recovery won't look there when it's recovering to the older
+	 * timeline. On the other hand, if we archive the partial segment, and
+	 * the original server on that timeline is still running and archives
+	 * the completed version of the same segment later, it will fail. (We
+	 * used to do that in 9.4 and below, and it caused such problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial
+	 * suffix, and archive it. Archive recovery will never try to read
+	 * .partial segments, so they will normally go unused. But in the odd
+	 * PITR case, the administrator can copy them manually to the pg_wal
+	 * directory (removing the suffix). They can be useful in debugging,
+	 * too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline,
+	 * however, we had already determined that the segment is complete, so
+	 * we can let it be archived normally. (In particular, if it was
+	 * restored from the archive to begin with, it's expected to have a
+	 * .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -6512,7 +6628,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	RecoveryXlogAction xlogaction;
 	struct stat st;
 
 	/*
@@ -7883,127 +7999,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
+	xlogaction = DetermineRecoveryXlogAction();
+	PerformRecoveryXLogAction(xlogaction);
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8107,7 +8109,7 @@ StartupXLOG(void)
 	 * and in case of a crash, recovering from it might take a longer than is
 	 * appropriate now that we're not in standby mode anymore.
 	 */
-	if (promoted)
+	if (xlogaction == RECOVERY_XLOG_WRITE_END_OF_RECOVERY)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8207,6 +8209,65 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Determine what needs to be done upon completing REDO.
+ */
+static RecoveryXlogAction
+DetermineRecoveryXlogAction(void)
+{
+	/* No REDO, hence no action required. */
+	if (!InRecovery)
+		return RECOVERY_XLOG_NOTHING;
+
+	/*
+	 * In promotion, only create a lightweight end-of-recovery record instead
+	 * of a full checkpoint. A checkpoint is requested later, after we're
+	 * fully out of recovery mode and already accepting WAL writes.
+	 */
+	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		LocalPromoteIsTriggered)
+		return RECOVERY_XLOG_WRITE_END_OF_RECOVERY;
+
+	/*
+	 * We decided against writing only an end-of-recovery record, and we know
+	 * that the postmaster was told to launch the checkpointer, so just
+	 * request a checkpoint.
+	 */
+	return RECOVERY_XLOG_REQUEST_CHECKPOINT;
+}
+
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static void
+PerformRecoveryXLogAction(RecoveryXlogAction action)
+{
+	switch (action)
+	{
+		case RECOVERY_XLOG_NOTHING:
+			/* No REDO performed, hence nothing to do. */
+			break;
+
+		case RECOVERY_XLOG_WRITE_END_OF_RECOVERY:
+			/* Lightweight end-of-recovery record in lieu of checkpoint. */
+			CreateEndOfRecoveryRecord();
+			break;
+
+		case RECOVERY_XLOG_REQUEST_CHECKPOINT:
+			/* Full checkpoint, when checkpointer is running. */
+			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+							  CHECKPOINT_IMMEDIATE |
+							  CHECKPOINT_WAIT);
+	}
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

#149

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Amul Sul (#148)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote:

Here is the rebased version.

v33-0004

This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectly included from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merely to get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary, but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c, xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c.

v33-0005

This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the value of LocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercible to boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that the function is more complicated, this rule seems even more arcane. Can we change the logic to not depend on casting an integer to bool?

The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e. wal writes permitted".

The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actually doing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process".

The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a reader of the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up wal probibit shared state"

The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests, we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I think this one should read "stop processing". This same comment is used again below. Then a third comment reads "For the same reason mentioned previously for the wal prohibit state change request check." That third comment is too glib.

tcop/utility.c needlessly includes "access/walprohibit.h"

wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a state change is clear enough. But how is waiting on a state different?

xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names of the values:

/* State of work that enables wal writes */
typedef enum XLogAcceptWritesState
{
XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */
XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */
} XLogAcceptWritesState;

This enum seems to have been written from the point of view of someone who already knew what it was for. It needs to be written in a way that will be clear to people who have no idea what it is for.

v33-0006:

The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes" reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */

The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the new function comment says:

/*
* In opposite to the above assertion if a transaction doesn't have valid XID
* (e.g. VACUUM) then it won't be killed while changing the system state to WAL
* prohibited. Therefore, we need to explicitly error out before entering into
* the critical section.
*/

This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needs wal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given to wonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it trigger a panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there is some reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to check whether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed through the lifetime of that critical section?

v33-0007:

I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from other signal sending functions in whether it returns a boolean? If you believe it must always succeed, you can still define it as returning a boolean and always return true. That leaves the door open to future code changes which might need to return false for some reason.

But I also don't like the idea that existing transactions with xids are immediately killed. Shouldn't this function take an optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into WALPROHIBIT_STATE_GOING_READ_ONLY for a period of time before killing remaining transactions?

Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false) means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This also would be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameter when allowing wal is less clearly useful.

That's enough code review for now. Next I will review your regression tests....

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#150

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Mark Dilger (#149)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Sep 9, 2021 at 1:42 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

v33-0006:

The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes" reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */

Honestly that sentence doesn't sound very clear even with a different verb.

This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needs wal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given to wonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it trigger a panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there is some reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to check whether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed through the lifetime of that critical section?

The idea here is that if a transaction already has an XID assigned, we
have to kill it off before we can declare the system read-only,
because it will definitely write WAL when the transaction ends: either
a commit record, or an abort record, but definitely something. So
cases where we write WAL without necessarily having an XID require
special handling. They have to check whether WAL has become prohibited
and error out if so, and they need to do so before entering the
critical section - because if the problem were detected for the first
time inside the critical section it would escalate to a PANIC, which
we do not want. Places where we're guaranteed to have an XID - e.g.
inserting a heap tuple - don't need a run-time check before entering
the critical section, because the code can't be reached in the first
place if the system is WAL-read-only.

Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false) means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This also would be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameter when allowing wal is less clearly useful.

Hmm, I find pg_prohibit_wal(true/false) better than pg_prohibit_wal()
and pg_allow_wal(), and would prefer pg_prohibit_wal(true/false,
timeout) over pg_prohibit_wal(timeout) and pg_allow_wal(), because I
think then once you find that one function you know how to do
everything about that feature, whereas the other way you need to find
both functions to have the whole story. That said, I can see why
somebody else might prefer something else.

--
Robert Haas
EDB: http://www.enterprisedb.com

#151

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Robert Haas (#150)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 9, 2021, at 11:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:

They have to check whether WAL has become prohibited
and error out if so, and they need to do so before entering the
critical section - because if the problem were detected for the first
time inside the critical section it would escalate to a PANIC, which
we do not want.

But that is the part that is still not clear. Should the comment say that a concurrent change to prohibit wal after the current process checks but before the current process exists the critical section will result in a panic? What is unclear about the comment is that it implies that a check before the critical section is sufficient, but ordinarily one would expect a lock to be held and the check-and-lock dance to carefully avoid any race condition. If somehow this is safe, the logic for why it is safe should be spelled out. If not, a mia culpa saying, "hey, were not terribly safe about this" should be explicit in the comment.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#152

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Mark Dilger (#149)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Thank you, for looking at the patch. Please see my reply inline below:

On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote:

Here is the rebased version.

v33-0004

This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectly included from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merely to get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary, but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c, xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c.

Yes, you are correct, xlog.h is included in more than 150 files. I was
wondering if we can have a forward declaration instead of including
pg_control.h (e.g. The same way struct XLogRecData was declared in
xlog.h). Perhaps, DBState is enum & I don't see we have done the same
for enum elsewhere as we are doing for structures, but that seems to
be fine, IMO.

Earlier, I was unsure before preparing this patch, but since that
makes sense (I assumed) and minimizes duplications, can we go ahead
and post separately with the same change in StartupXLOG() which I have
skipped for the same reason mentioned in patch commit-msg.

v33-0005

This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the value of LocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercible to boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that the function is more complicated, this rule seems even more arcane. Can we change the logic to not depend on casting an integer to bool?

We can't use a boolean variable because LocalXLogInsertAllowed
represents three states as, 1 means "wal is allowed'', 0 for "wal is
disallowed", and -1 is for "need to check".

The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e. wal writes permitted".

The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actually doing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process".

The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a reader of the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up wal probibit shared state"

The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests, we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I think this one should read "stop processing". This same comment is used again below. Then a third comment reads "For the same reason mentioned previously for the wal prohibit state change request check." That third comment is too glib.

tcop/utility.c needlessly includes "access/walprohibit.h"

wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a state change is clear enough. But how is waiting on a state different?

xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names of the values:

/* State of work that enables wal writes */
typedef enum XLogAcceptWritesState
{
XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */
XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */
} XLogAcceptWritesState;

This enum seems to have been written from the point of view of someone who already knew what it was for. It needs to be written in a way that will be clear to people who have no idea what it is for.

v33-0006:

The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes" reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */

Will try to think about the pointed code comments for the improvements.

The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the new function comment says:

CheckWALPermitted() calls XLogInsertAllowed() does check the
LocalXLogInsertAllowed flag which is local to that process only, and
nobody else reads that concurrently.

/*
* In opposite to the above assertion if a transaction doesn't have valid XID
* (e.g. VACUUM) then it won't be killed while changing the system state to WAL
* prohibited. Therefore, we need to explicitly error out before entering into
* the critical section.
*/

This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needs wal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given to wonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it trigger a panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there is some reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to check whether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed through the lifetime of that critical section?

Hm, interrupts absorption are disabled inside the critical section.
The wal prohibited state for that process (here vacuum) will never get
set until it sees the interrupts & the system will not be said wal
prohibited until every process sees that interrupts. I am not sure we
should explain the characteristics of the critical section at this
place, if want, we can add a brief saying that inside the critical
section we should not worry about the state change which never happens
because interrupts are disabled there.

v33-0007:

I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from other signal sending functions in whether it returns a boolean? If you believe it must always succeed, you can still define it as returning a boolean and always return true. That leaves the door open to future code changes which might need to return false for some reason.

Ok, I am fine to always return true.

But I also don't like the idea that existing transactions with xids are immediately killed. Shouldn't this function take an optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into WALPROHIBIT_STATE_GOING_READ_ONLY for a period of time before killing remaining transactions?

Ok, will check.

Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false) means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This also would be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameter when allowing wal is less clearly useful.

Like Robert, I am too inclined to have a single function that is easy
to remember. Apart from this, recently while testing this patch with
pgbench where I have exhausted the connection limit and want to change
the system's prohibited state in between but I was unable to do that,
I wish I could do that using the pg_clt option. How about having a
pg_clt option to alter wal prohibited state?

That's enough code review for now. Next I will review your regression tests....

Thanks again.

#153

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Amul Sul (#152)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 10, 2021, at 7:36 AM, Amul Sul <sulamul@gmail.com> wrote:

v33-0005

This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the value of LocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercible to boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that the function is more complicated, this rule seems even more arcane. Can we change the logic to not depend on casting an integer to bool?

We can't use a boolean variable because LocalXLogInsertAllowed
represents three states as, 1 means "wal is allowed'', 0 for "wal is
disallowed", and -1 is for "need to check".

I'm complaining that we're using an integer rather than an enum. I'm ok if we define it so that WAL_ALLOWABLE_UNKNOWN = -1, WAL_DISALLOWED = 0, WAL_ALLOWED = 1 or such, but the logic of the function has gotten complicated enough that having to remember which number represents which logical condition has become a (small) mental burden. Given how hard the WAL code is to read and fully grok, I'd rather avoid any unnecessary burden, even small ones.

The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the new function comment says:

CheckWALPermitted() calls XLogInsertAllowed() does check the
LocalXLogInsertAllowed flag which is local to that process only, and
nobody else reads that concurrently.

/*
* In opposite to the above assertion if a transaction doesn't have valid XID
* (e.g. VACUUM) then it won't be killed while changing the system state to WAL
* prohibited. Therefore, we need to explicitly error out before entering into
* the critical section.
*/

This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needs wal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given to wonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it trigger a panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there is some reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to check whether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed through the lifetime of that critical section?

Hm, interrupts absorption are disabled inside the critical section.
The wal prohibited state for that process (here vacuum) will never get
set until it sees the interrupts & the system will not be said wal
prohibited until every process sees that interrupts. I am not sure we
should explain the characteristics of the critical section at this
place, if want, we can add a brief saying that inside the critical
section we should not worry about the state change which never happens
because interrupts are disabled there.

I think the fact that interrupts are disabled during critical sections is understood, so there is no need to mention that. The problem is that the method for taking the system read-only is less generally known, and readers of other sections of code need to jump to the definition of CheckWALPermitted to read the comments and understand what it does. Take for example a code stanza from heapam.c:

if (needwal)
CheckWALPermitted();

/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();

Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that an interrupt might get processed *during* CheckWALPermitted, since that function isn't atomic. It might happen after the check is meaningfully finished but before the function actually returns. So I'm not inclined to believe that the way this all works is dependent on interrupts being blocked. So I think, maybe this is all protected by some other scheme. But what? It's not clear from the code comments for CheckWALPermitted, so I'm left having to reverse engineer the system to understand it.

One interpretation is that the signal handler will exit() my backend if it receives a signal saying that the system is going read-only, so there is no race condition. But then why the call to CheckWALPermitted()? If this interpretation were correct, we'd happily enter the critical section without checking, secure in the knowledge that as long as we haven't exited yet, all is ok.

Another interpretation is that the whole thing is just a performance trick. Maybe we're ok with the idea that we will occasionally miss the fact that wal is prohibited, do whatever work we need in the critical section, and then fail later. But if that is true, it had better not be a panic, because designing the system to panic 1% of the time (or whatever percent it works out to be) isn't project style. So looking into the critical section in the heapam.c code, I see:

XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapInplace);

XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);

And jumping to the definition of XLogBeginInsert() I see

/*
* WAL permission must have checked before entering the critical section.
* Otherwise, WAL prohibited error will force system panic.
*/

So now I'm flummoxed. Is it that the code is broken, or is it that I don't know what the strategy behind all this is? If there were a code comment saying how this all works, I'd be in a better position to either know that it is truly safe or alternately know that the strategy is wrong.

Even if my analysis that this is all flawed is incorrect, I still think that a code comment would help.

v33-0007:

I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from other signal sending functions in whether it returns a boolean? If you believe it must always succeed, you can still define it as returning a boolean and always return true. That leaves the door open to future code changes which might need to return false for some reason.

Ok, I am fine to always return true.

Ok.

But I also don't like the idea that existing transactions with xids are immediately killed. Shouldn't this function take an optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into WALPROHIBIT_STATE_GOING_READ_ONLY for a period of time before killing remaining transactions?

Ok, will check.

Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false) means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This also would be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameter when allowing wal is less clearly useful.

Like Robert, I am too inclined to have a single function that is easy
to remember.

For C language functions that take a bool argument, I can jump to the definition using ctags, and I assume most other developers can do so using whatever IDE they like. For SQL functions, it's a bit harder to jump to the definition, particularly if you are logged into a production server where non-essential software is intentionally missing. Then you have to wonder, what exactly is the boolean argument toggling here?

I don't feel strongly about this, though, and you don't need to change it.

Apart from this, recently while testing this patch with
pgbench where I have exhausted the connection limit and want to change
the system's prohibited state in between but I was unable to do that,
I wish I could do that using the pg_clt option. How about having a
pg_clt option to alter wal prohibited state?

I'd have to review the implementation, but sure, that sounds like a useful ability.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#154

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Mark Dilger (#153)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 10, 2021, at 8:42 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Take for example a code stanza from heapam.c:

if (needwal)
CheckWALPermitted();

/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();

Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that an interrupt might get processed *during* CheckWALPermitted, since that function isn't atomic.

A better example may be found in ginmetapage.c:

needwal = RelationNeedsWAL(indexrel);
if (needwal)
{
CheckWALPermitted();
computeLeafRecompressWALData(leaf);
}

/* Apply changes to page */
START_CRIT_SECTION();

Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument can't be made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is running.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#155

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Mark Dilger (#154)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Sep 10, 2021 at 12:20 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

A better example may be found in ginmetapage.c:

needwal = RelationNeedsWAL(indexrel);
if (needwal)
{
CheckWALPermitted();
computeLeafRecompressWALData(leaf);
}

/* Apply changes to page */
START_CRIT_SECTION();

Yeah, that looks sketchy. Why not move CheckWALPermitted() down a line?

Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument can't be made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is running.

I think the relevant question here is not "could a signal handler
fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant
question is the former, then there's no hope of ever making it work
because there's always a race condition. But the signal handler is
only setting flags whose only effect is to make a subsequent
CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when
the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call
ProcessInterrupts().

--
Robert Haas
EDB: http://www.enterprisedb.com

#156

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Robert Haas (#155)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 10, 2021, at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think the relevant question here is not "could a signal handler
fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant
question is the former, then there's no hope of ever making it work
because there's always a race condition. But the signal handler is
only setting flags whose only effect is to make a subsequent
CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when
the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call
ProcessInterrupts().

Ok, that makes more sense. I was reviewing the code after first reviewing the documentation changes, which lead me to believe the system was designed to respond more quickly than that:

+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot

and

+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated.

"forces the system" in the first part, and "at that moment ... that transaction will be terminated" sounds heavier handed than something which merely sets a flag asking the backend to exit. I was reading that as more immediate and then trying to figure out how the signal handling could possibly work, and failing to see how.

The README:

+Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.

uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is heavier handed than it is.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#157

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Mark Dilger (#156)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Sep 10, 2021 at 1:16 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is heavier handed than it is.

It's intended to be just as immediate as e.g. pg_cancel_backend() and
pg_terminate_backend(), which work just the same way, but not any more
so. I guess we could look at how things are worded in those cases.
From a user perspective such things are usually pretty immediate, but
not as immediate as firing a signal handler. Computers are fast.[1]https://www.youtube.com/watch?v=6xijhqU8r2A

--
Robert Haas
EDB: http://www.enterprisedb.com

[1]: https://www.youtube.com/watch?v=6xijhqU8r2A

#158

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: amul sul (#1)

2 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Jun 16, 2020, at 6:55 AM, amul sul <sulamul@gmail.com> wrote:

(2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.

Andres,

I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned page held by the cursor, when another session disables WAL. It would be very interesting to test how the vacuum handles that specific change. I have not figured out the cleanest way to do this, though, as we don't as a project yet have a standard way of setting up xid exhaustion in a regression test, do we? The closest I saw to it was your work in [1], but that doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which isn't useable from inside an isolation test. Do you have a suggestion how best to continue developing out the test infrastructure?

Amul,

The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the isolation tester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test to abort and all subsequent permutations to not run. Attached patch v34-0009 fixes that.

Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a different session. The expected output is worth reviewing to see how this plays out. I don't see anything in there which is obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WAL is now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one of these errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky.

Attachments:

v34-0009-Make-isolationtester-handle-closed-sessions.patchapplication/octet-stream; name=v34-0009-Make-isolationtester-handle-closed-sessions.patch; x-unix-mode=0644Download

From b7c21e95c9a48208f1aec377366b26e40cf9ecb4 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 14 Sep 2021 15:18:52 -0700
Subject: [PATCH v34 09/10] Make isolationtester handle closed sessions.

The recent implementation of ALTER SYSTEM READ ONLY should be
testable from src/test/isolation.  For that, the isolation tester
needs to reconnect after getting kicked out rather than aborting.
---
 src/test/isolation/isolationtester.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 88594a3cb5..87cf88cbec 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -85,6 +85,24 @@ disconnect_atexit(void)
 			PQfinish(conns[i].conn);
 }
 
+static void
+restore_connection(PGconn *conn)
+{
+	if (PQstatus(conn) != CONNECTION_BAD)
+		return;
+
+	fprintf(stderr,
+			_("The connection to the server was lost. Attempting reset: "));
+	PQreset(conn);
+	if (PQstatus(conn) == CONNECTION_BAD)
+	{
+		fprintf(stderr, _("Failed.\n"));
+		exit(1);
+	}
+	else
+		fprintf(stderr, _("Succeeded.\n"));
+}
+
 int
 main(int argc, char **argv)
 {
@@ -890,7 +908,7 @@ try_complete_step(TestSpec *testspec, PermutationStep *pstep, int flags)
 					{
 						fprintf(stderr, "PQconsumeInput failed: %s\n",
 								PQerrorMessage(conn));
-						exit(1);
+						restore_connection(conn);
 					}
 					if (!PQisBusy(conn))
 						break;
@@ -964,7 +982,7 @@ try_complete_step(TestSpec *testspec, PermutationStep *pstep, int flags)
 		{
 			fprintf(stderr, "PQconsumeInput failed: %s\n",
 					PQerrorMessage(conn));
-			exit(1);
+			restore_connection(conn);
 		}
 	}
 
@@ -1013,6 +1031,8 @@ try_complete_step(TestSpec *testspec, PermutationStep *pstep, int flags)
 						printf("%s:  %s\n", sev, msg);
 					else
 						printf("%s\n", PQresultErrorMessage(res));
+
+					restore_connection(conn);
 				}
 				break;
 			default:
-- 
2.21.1 (Apple Git-122.3)

v34-0010-Test-ALTER-SYSTEM-READ-ONLY-against-cursors.patchapplication/octet-stream; name=v34-0010-Test-ALTER-SYSTEM-READ-ONLY-against-cursors.patch; x-unix-mode=0644Download

From ce3054a97c78325b06e67219fe0be3535b7be272 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 14 Sep 2021 15:28:09 -0700
Subject: [PATCH v34 10/10] Test ALTER SYSTEM READ ONLY against cursors.

Add an isolation test which checks the impact of setting the system
read-only upon cursors that use a FOR UPDATE query.
---
 .../expected/cursor-prohibit-wal.out          | 157 ++++++++++++++++++
 src/test/isolation/isolation_schedule         |   1 +
 .../isolation/specs/cursor-prohibit-wal.spec  |  38 +++++
 3 files changed, 196 insertions(+)
 create mode 100644 src/test/isolation/expected/cursor-prohibit-wal.out
 create mode 100644 src/test/isolation/specs/cursor-prohibit-wal.spec

diff --git a/src/test/isolation/expected/cursor-prohibit-wal.out b/src/test/isolation/expected/cursor-prohibit-wal.out
new file mode 100644
index 0000000000..6b479143a6
--- /dev/null
+++ b/src/test/isolation/expected/cursor-prohibit-wal.out
@@ -0,0 +1,157 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1a s1b s1c s2a s2b s1d
+step s1a: DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE;
+step s1b: FETCH FORWARD ALL FROM rw_cur;
+   a
+----
+5000
+(1 row)
+
+step s1c: CLOSE ALL;
+step s2a: SELECT pg_prohibit_wal(true);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s2b: SELECT pg_prohibit_wal(false);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+PQconsumeInput failed: FATAL:  WAL is now prohibited
+HINT:  Sessions with open write transactions must be terminated.
+server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+
+The connection to the server was lost. Attempting reset: Succeeded.
+step s1d: ROLLBACK;
+
+starting permutation: s1a s1b s2a s1c s2b s1d
+step s1a: DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE;
+step s1b: FETCH FORWARD ALL FROM rw_cur;
+   a
+----
+5000
+(1 row)
+
+step s2a: SELECT pg_prohibit_wal(true);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s1c: CLOSE ALL;
+FATAL:  WAL is now prohibited
+FATAL:  WAL is now prohibited
+HINT:  Sessions with open write transactions must be terminated.
+server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+
+The connection to the server was lost. Attempting reset: Succeeded.
+step s2b: SELECT pg_prohibit_wal(false);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+s1: WARNING:  there is no transaction in progress
+step s1d: ROLLBACK;
+
+starting permutation: s1a s1b s2a s2b s1c s1d
+step s1a: DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE;
+step s1b: FETCH FORWARD ALL FROM rw_cur;
+   a
+----
+5000
+(1 row)
+
+step s2a: SELECT pg_prohibit_wal(true);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s2b: SELECT pg_prohibit_wal(false);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+PQconsumeInput failed: FATAL:  WAL is now prohibited
+HINT:  Sessions with open write transactions must be terminated.
+server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+
+The connection to the server was lost. Attempting reset: Succeeded.
+step s1c: CLOSE ALL;
+s1: WARNING:  there is no transaction in progress
+step s1d: ROLLBACK;
+
+starting permutation: s1a s2a s1b s1c s2b s1d
+step s1a: DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE;
+step s2a: SELECT pg_prohibit_wal(true);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s1b: FETCH FORWARD ALL FROM rw_cur;
+ERROR:  WAL is now prohibited
+step s1c: CLOSE ALL;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+step s2b: SELECT pg_prohibit_wal(false);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s1d: ROLLBACK;
+
+starting permutation: s1a s2a s1b s2b s1c s1d
+step s1a: DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE;
+step s2a: SELECT pg_prohibit_wal(true);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s1b: FETCH FORWARD ALL FROM rw_cur;
+ERROR:  WAL is now prohibited
+step s2b: SELECT pg_prohibit_wal(false);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s1c: CLOSE ALL;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+step s1d: ROLLBACK;
+
+starting permutation: s1a s2a s2b s1b s1c s1d
+step s1a: DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE;
+step s2a: SELECT pg_prohibit_wal(true);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s2b: SELECT pg_prohibit_wal(false);
+pg_prohibit_wal
+---------------
+               
+(1 row)
+
+step s1b: FETCH FORWARD ALL FROM rw_cur;
+   a
+----
+5000
+(1 row)
+
+step s1c: CLOSE ALL;
+step s1d: ROLLBACK;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index f4c01006fc..fb79e010b0 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -96,3 +96,4 @@ test: plpgsql-toast
 test: truncate-conflict
 test: serializable-parallel
 test: serializable-parallel-2
+test: cursor-prohibit-wal
diff --git a/src/test/isolation/specs/cursor-prohibit-wal.spec b/src/test/isolation/specs/cursor-prohibit-wal.spec
new file mode 100644
index 0000000000..8f699a4c6e
--- /dev/null
+++ b/src/test/isolation/specs/cursor-prohibit-wal.spec
@@ -0,0 +1,38 @@
+# Test for behavior of a cursor over an index scan when the server is read-only
+# and the cursor moves over index pages that refer to dead tuples.  The key
+# point is that the server is read-only, but the scan of the index page will
+# want to hint the tuples as dead.
+# 
+# The transaction operating the cursor is intentionally read-only in its
+# behavior.  If it were to do anything that generated a transaction ID, we
+# would expect it to be terminated when the server goes read-only.
+#
+
+setup
+{
+	CREATE TABLE tbl (a integer);
+	INSERT INTO tbl SELECT * FROM generate_series(1,10000);
+}
+
+teardown
+{
+	DROP TABLE tbl;
+}
+
+session s1
+setup			{ BEGIN; }
+step s1a		{ DECLARE rw_cur CURSOR FOR SELECT a FROM tbl WHERE a = 5000 FOR UPDATE; }
+step s1b		{ FETCH FORWARD ALL FROM rw_cur; }
+step s1c		{ CLOSE ALL; }
+step s1d		{ ROLLBACK; }
+
+session s2
+step s2a		{ SELECT pg_prohibit_wal(true); }
+step s2b		{ SELECT pg_prohibit_wal(false); }
+
+permutation		s1a s1b s1c s2a s2b s1d
+permutation		s1a s1b s2a s1c s2b s1d
+permutation		s1a s1b s2a s2b s1c s1d
+permutation		s1a s2a s1b s1c s2b s1d
+permutation		s1a s2a s1b s2b s1c s1d
+permutation		s1a s2a s2b s1b s1c s1d
-- 
2.21.1 (Apple Git-122.3)

#159

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#135)

3 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

, On Sat, Jul 24, 2021 at 1:33 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 17, 2021 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is rebase for the latest master head. Also, I added one more
refactoring code that deduplicates the code setting database state in the
control file. The same code set the database state is also needed for this
feature.

I started studying 0001 today and found that it rearranged the order
of operations in StartupXLOG() more than I was expecting. It does, as
per previous discussions, move a bunch of things to the place where we
now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
XLogReaderFree() then have to move down even further. Since the goal
here is to get to a situation where we sometimes XLogAcceptWrites()
after InRecovery = false, it didn't seem nice for this refactoring
patch to still end up with a situation where this stuff happens while
InRecovery = true. In fact, with the patch, the amount of code that
runs with InRecovery = true actually *increases*, which is not what I
think should be happening here. That's why the patch ends up having to
adjust SetMultiXactIdLimit to not Assert(!InRecovery).

And then I started to wonder how this was ever going to work as part
of the larger patch set, because as you have it here,
XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
checkpointer is calling that at a later time after the user issues
pg_prohibit_wal(false), it's going to have none of those things. So I
had a quick look at that part of the code and found this in
checkpointer.c:

XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);

For those following along from home, the additional "true" is a bool
needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
this is very satisfying. The whole purpose of passing the xlogreader
is so we can figure out whether we need a checkpoint (never mind the
question of whether the existing algorithm for determining that is
really sensible) but now we need a second argument that basically
serves the same purpose since one of the two callers to this function
won't have an xlogreader. And then we're passing the EndOfLog and
EndOfLogTLI as dummy values which seems like it's probably just
totally wrong, but if for some reason it works correctly there sure
don't seem to be any comments explaining why.

So I started doing a bit of hacking myself and ended up with the
attached, which I think is not completely the right thing yet but I
think it's better than your version. I split this into three parts.
0001 splits up the logic that currently decides whether to write an
end-of-recovery record or a checkpoint record and if the latter how
the checkpoint ought to be performed into two functions.
DetermineRecoveryXlogAction() figures out what we want to do, and
PerformRecoveryXlogAction() does it. It also moves the code to run
recovery_end_command and related stuff into a new function
CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
CleanupAfterArchiveRecovery() to just before we
XLogReportParameters(). Because of the refactoring done by 0001, this
is only a small amount of code movement. Because of the separation
between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
the latter doesn't need the xlogreader. So we can do
DetermineRecoveryXlogAction() at the same time as now, while the
xlogreader is available, and then we don't need it later when we
PerformRecoveryXlogAction(), because we already know what we need to
know. I think this is all fine as far as it goes.

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
seem too problematic, because that value has been stored in four (!)
places inside XLogCtl by this code:

LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

XLogCtl->LogwrtResult = LogwrtResult;

XLogCtl->LogwrtRqst.Write = EndOfLog;
XLogCtl->LogwrtRqst.Flush = EndOfLog;

Presumably we could relatively easily change things around so that we
finish one of those values ... probably one of the "write" values ..
back out of XLogCtl instead of passing it as a parameter. That would
work just as well from the checkpointer as from the startup process,
and there seems to be no way for the value to change until after
XLogAcceptWrites() has been called, so it seems fine. But that doesn't
help for the other arguments. What I'm thinking is that we should just
arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and
then XLogAcceptWrites() can fish those values out of there as well,
which should be enough to make it work and do the same thing
regardless of which process is calling it. But I have run out of time
for today so have not explored coding that up.

I have spent some time thinking about making XLogAcceptWrites()
independent, and for that, we need to get rid of its arguments which
are available only in the startup process. The first argument
xlogaction deduced by the DetermineRecoveryXlogAction(). If we are able to
make this function logic independent and can deduce that xlogaction in
any process, we can skip xlogaction argument passing.

DetermineRecoveryXlogAction() function depends on a few global
variables, valid only in the startup process are InRecovery,
ArchiveRecoveryRequested & LocalPromoteIsTriggered. Out of
three LocalPromoteIsTriggered's value is already available in shared
memory and that can be fetched by calling LocalPromoteIsTriggered().
InRecovery's value can be guessed by as long as DBState in the control
file doesn't get changed unexpectedly until XLogAcceptWrites()
executes. If the DBState was not a clean shutdown, then surely the
server has gone through recovery. If we could rely on DBState in the
control file then we are good to go. For the last one,
ArchiveRecoveryRequested, I don't see any existing and appropriate
shared memory or control file information, so that can be identified
if the archive recovery was requested or not. Initially, I thought to
use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE,
if the archive recovery requested. But there is another case where
SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of
ArchiveRecoveryRequested value, that is the presence of a backup label
file. If we want to use SharedRecoveryState, we need one more state
which could differentiate between ArchiveRecoveryRequested and the
backup label file presence case. To move ahead, I have copied
ArchiveRecoveryRequested into shared memory and it will be cleared
once archive cleanup is finished. With all these changes, we could get
rid of xlogaction argument and DetermineRecoveryXlogAction() function.
Could move its logic to PerformRecoveryXLogAction() directly.

Now, the remaining two arguments of XLogAcceptWrites() are required
for the CleanupAfterArchiveRecovery() function. Along with these two
arguments, this function requires ArchiveRecoveryRequested and
ThisTimeLineID which are again global variables. With the previous
changes, we have got ArchiveRecoveryRequested into shared memory.
And for ThisTimeLineID, I don't think we need to do anything since this
value is available with all the backend as per the following comment:

"
/*
* ThisTimeLineID will be same in all backends --- it identifies current
* WAL timeline for the database system.
*/
TimeLineID ThisTimeLineID = 0;
"

In addition to the four places that Robert has pointed for EndOfLog,
XLogCtl->lastSegSwitchLSN also holds EndOfLog value and that doesn't
seem to change until WAL write is enabled. For EndOfLogTLI, I think we
can safely use XLogCtl->replayEndTLI. Currently, The EndOfLogTLI is
the timeline ID of the last record that xlogreader reads, but this
xlogreader was simply re-fetching the last record which we have
replied in redo loop if it was in recovery, if not in recovery, we
don't need to worry since this value is needed only in case of
ArchiveRecoveryRequested = true, which implicitly forces redo and sets
replayEndTLI value.

With all the above changes XLogAcceptWrites() can be called from other
processes but I haven't tested that. This finding is still not
complete and not too clean, perhaps, posting the patches with
aforesaid changes just to confirm the direction and forward the
discussion, thanks.

Regards,
Amul

Attachments:

v34-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v34-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 15329d85e26967602e5aedb14e10f31a8631e33c Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v34 1/3] Refactor some end-of-recovery code out of
 StartupXLOG().

Moved the code that performs whether to write a checkpoint or an
end-of-recovery record into PerformRecoveryXlogAction().

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.
---
 src/backend/access/transam/xlog.c | 277 +++++++++++++++++-------------
 1 file changed, 159 insertions(+), 118 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749da..cd1d87c14b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -637,6 +637,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+	 * requested. Protected by info_lck.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -880,6 +886,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -925,6 +932,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5223,6 +5231,7 @@ XLOGShmemInit(void)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
@@ -5507,6 +5516,12 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.  A lock is not
+	 * needed since we are the only ones who updating this.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5694,6 +5709,95 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(void)
+{
+	XLogRecPtr EndOfLog;
+
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old
+	 * timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline,
+	 * remove them. They might contain valid WAL, but they might also be
+	 * pre-allocated files containing garbage. In any case, they are not
+	 * part of the new timeline's history so we don't need them.
+	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with
+	 * the last, partial segment on the old timeline? If we don't archive
+	 * it, and the server that created the WAL never archives it either
+	 * (e.g. because it was hit by a meteor), it will never make it to the
+	 * archive. That's OK from our point of view, because the new segment
+	 * that we created with the new TLI contains all the WAL from the old
+	 * timeline up to the switch point. But if you later try to do PITR to
+	 * the "missing" WAL on the old timeline, recovery won't find it in
+	 * the archive. It's physically present in the new file with new TLI,
+	 * but recovery won't look there when it's recovering to the older
+	 * timeline. On the other hand, if we archive the partial segment, and
+	 * the original server on that timeline is still running and archives
+	 * the completed version of the same segment later, it will fail. (We
+	 * used to do that in 9.4 and below, and it caused such problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial
+	 * suffix, and archive it. Archive recovery will never try to read
+	 * .partial segments, so they will normally go unused. But in the odd
+	 * PITR case, the administrator can copy them manually to the pg_wal
+	 * directory (removing the suffix). They can be useful in debugging,
+	 * too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline,
+	 * however, we had already determined that the segment is complete, so
+	 * we can let it be archived normally. (In particular, if it was
+	 * restored from the archive to begin with, it's expected to have a
+	 * .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+		TimeLineID EndOfLogTLI = XLogCtl->replayEndTLI;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7883,127 +7987,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+		promoted = PerformRecoveryXLogAction();
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8207,6 +8197,57 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static bool
+PerformRecoveryXLogAction(void)
+{
+	bool		promoted = false;
+
+	/*
+	 * In promotion, only create a lightweight end-of-recovery record
+	 * instead of a full checkpoint. A checkpoint is requested later,
+	 * after we're fully out of recovery mode and already accepting
+	 * queries.
+	 *
+	 * NB: Check does not rely on the global variables are valid only in the
+	 * startup process only.
+	 */
+	if (((volatile XLogCtlData *) XLogCtl)->SharedArchiveRecoveryRequested &&
+		IsUnderPostmaster && PromoteIsTriggered())
+	{
+		promoted = true;
+
+		/*
+		 * Insert a special WAL record to mark the end of recovery, since
+		 * we aren't doing a checkpoint. That means that the checkpointer
+		 * process may likely be in the middle of a time-smoothed
+		 * restartpoint and could continue to be for minutes after this.
+		 * That sounds strange, but the effect is roughly the same and it
+		 * would be stranger to try to come out of the restartpoint and
+		 * then checkpoint. We request a checkpoint later anyway, just for
+		 * safety.
+		 */
+		CreateEndOfRecoveryRecord();
+	}
+	else
+	{
+		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+						  CHECKPOINT_IMMEDIATE |
+						  CHECKPOINT_WAIT);
+	}
+
+	return promoted;
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

v34-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v34-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From c75ab1ae31afc00c2c4e22902fe421d344d64add Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 15 Sep 2021 01:09:42 -0400
Subject: [PATCH v34 3/3] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.
---
 src/backend/access/transam/xlog.c | 84 ++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ffec084ee98..30600db19f4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -932,6 +932,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -6616,7 +6617,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	bool		promoted;
 	struct stat st;
 
 	/*
@@ -8029,38 +8030,14 @@ StartupXLOG(void)
 	XLogReaderFree(xlogreader);
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
-	 * shut down cleanly, which been through recovery.
-	 */
-	if (ControlFile->state != DB_SHUTDOWNED)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
 
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8104,6 +8081,53 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(void)
+{
+	bool		promoted = false;
+
+	/* Write an XLOG_FPW_CHANGE record */
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (((volatile XLogCtlData *) XLogCtl)->SharedArchiveRecoveryRequested)
+	{
+		CleanupAfterArchiveRecovery();
+
+		/* Done with archive recovery cleanup, clear the share memory state. */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		XLogCtl->SharedArchiveRecoveryRequested = false;
+		SpinLockRelease(&XLogCtl->info_lck);
+	}
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v34-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v34-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From f43812b8984d4d07f2142019451d8ba14fbece58 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 15 Sep 2021 00:39:03 -0400
Subject: [PATCH v34 2/3] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, we issued XLOG_FPW_CHANGE and either
XLOG_CHECKPOINT_SHUTDOWN or XLOG_END_OF_RECOVERY while still
technically in recovery, and also performed post-archive-recovery
cleanup steps at that point. Postpone that stuff until after we clear
InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.
---
 src/backend/access/transam/xlog.c | 39 +++++++++++++++++--------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cd1d87c14b3..ffec084ee98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7977,24 +7977,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	if (InRecovery)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery();
-
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -8046,6 +8028,27 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery();
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

#160

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#159)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 15, 2021 at 6:49 AM Amul Sul <sulamul@gmail.com> wrote:

Initially, I thought to
use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE,
if the archive recovery requested. But there is another case where
SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of
ArchiveRecoveryRequested value, that is the presence of a backup label
file.

Right, there's a difference between whether archive recovery has been
*requested* and whether it is actually *happening*.

If we want to use SharedRecoveryState, we need one more state
which could differentiate between ArchiveRecoveryRequested and the
backup label file presence case. To move ahead, I have copied
ArchiveRecoveryRequested into shared memory and it will be cleared
once archive cleanup is finished. With all these changes, we could get
rid of xlogaction argument and DetermineRecoveryXlogAction() function.
Could move its logic to PerformRecoveryXLogAction() directly.

Putting these changes into 0001 seems to make no sense. It seems like
they should be part of 0003, or maybe a new 0004 patch.

And for ThisTimeLineID, I don't think we need to do anything since this
value is available with all the backend as per the following comment:
"
/*
* ThisTimeLineID will be same in all backends --- it identifies current
* WAL timeline for the database system.
*/
TimeLineID ThisTimeLineID = 0;

I'm not sure I find that argument totally convincing. The two
variables aren't assigned at exactly the same places in the code,
nonwithstanding the comment. I'm not saying you're wrong. I'm just
saying I don't believe it just because the comment says so.

--
Robert Haas
EDB: http://www.enterprisedb.com

#161

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Robert Haas (#160)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 15, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote:

Putting these changes into 0001 seems to make no sense. It seems like
they should be part of 0003, or maybe a new 0004 patch.

After looking at this a little bit more, I think it's really necessary
to separate out all of your changes into separate patches at least for
initial review. It's particularly important to separate code movement
changes from other kinds of changes. 0001 was just moving code before,
and so was 0002, but now both are making other changes, which is not
easy to see from looking at the 'git diff' output. For that reason
it's not so easy to understand exactly what you've changed here and
analyze it.

I poked around a little bit at these patches, looking for
perhaps-interesting global variables upon which the code called from
XLogAcceptWrites() would depend with your patches applied. The most
interesting ones seem to be (1) ThisTimeLineID, which you mentioned
and which may be fine but I'm not totally convinced yet, (2)
LocalXLogInsertAllowed, which is probably not broken but I'm thinking
we may want to redesign that mechanism somehow to make it cleaner, and
(3) CheckpointStats, which is called from RemoveXlogFile which is
called from RemoveNonParentXlogFiles which is called from
CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
This last one is actually pretty weird already in the existing code.
It sort of looks like RemoveXlogFile() only expects to be called from
the checkpointer (or a standalone backend) so that it can update
CheckpointStats and have that just work, but actually it's also called
from the startup process when a timeline switch happens. I don't know
whether the fact that the increments to ckpt_segs_recycled get lost in
that case should be considered an intentional behavior that should be
preserved or an inadvertent mistake.

So I think you've covered most of the necessary things here, with
probably some more discussion needed on whether you've done the right
things...

--
Robert Haas
EDB: http://www.enterprisedb.com

#162

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#161)

5 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 15, 2021 at 9:38 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 15, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote:

Putting these changes into 0001 seems to make no sense. It seems like
they should be part of 0003, or maybe a new 0004 patch.

After looking at this a little bit more, I think it's really necessary
to separate out all of your changes into separate patches at least for
initial review. It's particularly important to separate code movement
changes from other kinds of changes. 0001 was just moving code before,
and so was 0002, but now both are making other changes, which is not
easy to see from looking at the 'git diff' output. For that reason
it's not so easy to understand exactly what you've changed here and
analyze it.

Ok, understood, I have separated my changes into 0001 and 0002 patch,
and the refactoring patches start from 0003.

In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
memory as said previously. Coping ArchiveRecoveryRequested value to
shared memory is not really interesting, and I think somehow we should
reuse existing variable, (perhaps, with some modification of the
information it can store, e.g. adding one more enum value for
SharedRecoveryState or something else), thinking on the same.

In addition to that, I tried to turn down the scope of
ArchiveRecoveryRequested global variable. Now, this is a static
variable, and the scope is limited to xlog.c file like
LocalXLogInsertAllowed and can be accessed through the newly added
function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
me know what you think about the approach.

In 0002 patch is a mixed one where I tried to remove the dependencies
on global variables and local variables belonging to StartupXLOG(). I
am still worried about the InRecovery value that needs to be deduced
afterward inside XLogAcceptWrites(). Currently, relying on
ControlFile->state != DB_SHUTDOWNED check but I think that will not be
good for ASRO where we plan to skip XLogAcceptWrites() work only and
let the StartupXLOG() do the rest of the work as it is where it will
going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
might need some ugly kludge to call PerformRecoveryXLogAction() in
checkpointer irrespective of DBState, which makes me a bit
uncomfortable.

I poked around a little bit at these patches, looking for
perhaps-interesting global variables upon which the code called from
XLogAcceptWrites() would depend with your patches applied. The most
interesting ones seem to be (1) ThisTimeLineID, which you mentioned
and which may be fine but I'm not totally convinced yet, (2)
LocalXLogInsertAllowed, which is probably not broken but I'm thinking
we may want to redesign that mechanism somehow to make it cleaner, and

Thanks for the off-list detailed explanation on this.

For somebody else who might be reading this, the concern here is (not
really a concern, it is a good thing to improve) the
LocalSetXLogInsertAllowed() function call, is a kind of hack that
enables wal writes irrespective of RecoveryInProgress() for a shorter
period. E.g. see following code in StartupXLOG:

"
LocalSetXLogInsertAllowed();
UpdateFullPageWrites();
LocalXLogInsertAllowed = -1;
....
....
/*
* If any of the critical GUCs have changed, log them before we allow
* backends to write WAL.
*/
LocalSetXLogInsertAllowed();
XLogReportParameters();
"

Instead of explicitly enabling wal insert, somehow that implicitly
allowed for the startup process and/or the checkpointer doing the
first checkpoint and/or wal writes after the recovery. Well, the
current LocalSetXLogInsertAllowed() mechanism is not really harming
anything or bad and does not necessarily need to change but it would
be nice if we were able to come up with something much cleaner,
bug-free, and 100% perfect enough design.

(Hope I am not missing anything from the discussion).

(3) CheckpointStats, which is called from RemoveXlogFile which is
called from RemoveNonParentXlogFiles which is called from
CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
This last one is actually pretty weird already in the existing code.
It sort of looks like RemoveXlogFile() only expects to be called from
the checkpointer (or a standalone backend) so that it can update
CheckpointStats and have that just work, but actually it's also called
from the startup process when a timeline switch happens. I don't know
whether the fact that the increments to ckpt_segs_recycled get lost in
that case should be considered an intentional behavior that should be
preserved or an inadvertent mistake.

Maybe I could be wrong, but I think that is intentional. It removes
pre-allocated or bogus files of the old timeline which are not
supposed to be considered in stats. The comments for
CheckpointStatsData might not be clear but comment at the calling
RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:

"
/*
* Before we continue on the new timeline, clean up any
* (possibly bogus) future WAL segments on the old
* timeline.
*/
RemoveNonParentXlogFiles(EndRecPtr, ThisTimeLineID);
....
....

* We switched to a new timeline. Clean up segments on the old
* timeline.
*
* If there are any higher-numbered segments on the old timeline,
* remove them. They might contain valid WAL, but they might also be
* pre-allocated files containing garbage. In any case, they are not
* part of the new timeline's history so we don't need them.
*/
RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
"

So I think you've covered most of the necessary things here, with
probably some more discussion needed on whether you've done the right
things...

Thanks, Robert, for your time.

Regards,
Amul Sul

Attachments:

v35-0005-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v35-0005-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 9d876ab6ffe05228457c9c58442d05cefceb1584 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 07:40:44 -0400
Subject: [PATCH v35 5/5] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 77 +++++++++++++++++++------------
 1 file changed, 47 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e8b1cf4ce20..642608013ae 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -938,6 +938,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -6628,7 +6629,7 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
+	bool		promoted;
 	struct stat st;
 
 	/*
@@ -8041,38 +8042,14 @@ StartupXLOG(void)
 	XLogReaderFree(xlogreader);
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
-	 * shut down cleanly, which been through recovery.
-	 */
-	if (ControlFile->state != DB_SHUTDOWNED)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryIsRequested())
-		CleanupAfterArchiveRecovery();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
 
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8116,6 +8093,46 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(void)
+{
+	bool		promoted = false;
+
+	/* Write an XLOG_FPW_CHANGE record */
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryIsRequested())
+		CleanupAfterArchiveRecovery();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v35-0002-miscellaneous-remove-dependency-on-global-and-lo.patchapplication/x-patch; name=v35-0002-miscellaneous-remove-dependency-on-global-and-lo.patchDownload

From 30392aca58c4f9c103047eed104917f63e5d9236 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 06:43:32 -0400
Subject: [PATCH v35 2/5] miscellaneous: remove dependency on global and local
 variable.

Removes dependency on global variables and some local variable in
StartupXLOG() function which may be available and/or deduced through
the information available into shared memory.

Changes enable us to move some of the code from StartupXLOG() into
a separate function that can be executable by the other process which
are connected to shared memory.
---
 src/backend/access/transam/xlog.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a6ddfb1872..c9d5bf9a72c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7902,7 +7902,11 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7919,7 +7923,7 @@ StartupXLOG(void)
 		 * queries.
 		 */
 		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
+			PromoteIsTriggered())
 		{
 			promoted = true;
 
@@ -7945,6 +7949,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryIsRequested())
 	{
+		XLogRecPtr	EndOfLog;
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7962,6 +7968,7 @@ StartupXLOG(void)
 		 * pre-allocated files containing garbage. In any case, they are not
 		 * part of the new timeline's history so we don't need them.
 		 */
+		(void) GetLastSegSwitchData(&EndOfLog);
 		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
 
 		/*
@@ -7998,6 +8005,7 @@ StartupXLOG(void)
 		{
 			char		origfname[MAXFNAMELEN];
 			XLogSegNo	endLogSegNo;
+			TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
 			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-- 
2.18.0

v35-0004-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v35-0004-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From 8062a67ac681940bb239b06e9277a765a546905d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 07:40:36 -0400
Subject: [PATCH v35 4/5] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, moved the code that performs whether to write a checkpoint
or an end-of-recovery record into PerformRecoveryXlogAction(), and
code performs post-archive-recovery into CleanupAfterArchiveRecovery(),
but called both the functions from the same place. Now postpone that
stuff until after we clear InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.

Robert Haas, with modifications by Amul Sul
---
 src/backend/access/transam/xlog.c | 42 +++++++++++++++----------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 12f7e080d3e..e8b1cf4ce20 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7989,27 +7989,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;

-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
-	 * shut down cleanly, which been through recovery.
-	 */
-	if (ControlFile->state != DB_SHUTDOWNED)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryIsRequested())
-		CleanupAfterArchiveRecovery();
-
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -8061,6 +8040,27 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);

+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryIsRequested())
+		CleanupAfterArchiveRecovery();
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

v35-0001-Store-ArchiveRecoveryRequested-in-shared-memory-.patchapplication/x-patch; name=v35-0001-Store-ArchiveRecoveryRequested-in-shared-memory-.patchDownload

From 2f5dc2d9ca71d9bc4ba5dc86a69efca7aa4408f9 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 05:52:19 -0400
Subject: [PATCH v35 1/5] Store ArchiveRecoveryRequested in shared memory and
 change its scope.

Storing ArchiveRecoveryRequested value in shared memory makes it
accessible to other processes as well. As of now no other process does
care about that but this will help to move code executed when
ArchiveRecoveryRequested set to other processes.

Also, the patch does change the scope of ArchiveRecoveryRequested
global to local, and type to integer. Now it has three values as -1
for the unknown, 1 for the request is made & 0 for no request or
request is completed.
---
 src/backend/access/transam/timeline.c    |   6 +-
 src/backend/access/transam/xlog.c        | 122 ++++++++++++++++-------
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/include/access/xlog.h                |   1 +
 src/include/access/xlog_internal.h       |   1 -
 5 files changed, 93 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c1756..0d4951b8255 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -93,7 +93,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 		return list_make1(entry);
 	}
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		TLHistoryFileName(histfname, targetTLI);
 		fromArchive =
@@ -229,7 +229,7 @@ existsTimeLineHistory(TimeLineID probeTLI)
 	if (probeTLI == 1)
 		return false;
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		TLHistoryFileName(histfname, probeTLI);
 		RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
@@ -331,7 +331,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	/*
 	 * If a history file exists for the parent, copy it verbatim
 	 */
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		TLHistoryFileName(histfname, parentTLI);
 		RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749da..4a6ddfb1872 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -239,17 +239,23 @@ static bool LocalPromoteIsTriggered = false;
 static int	LocalXLogInsertAllowed = -1;
 
 /*
- * When ArchiveRecoveryRequested is set, archive recovery was requested,
- * ie. signal files were present. When InArchiveRecovery is set, we are
- * currently recovering using offline XLOG archives. These variables are only
- * valid in the startup process.
- *
- * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
- * currently performing crash recovery using only XLOG files in pg_wal, but
- * will switch to using offline XLOG archives as soon as we reach the end of
- * WAL in pg_wal.
-*/
-bool		ArchiveRecoveryRequested = false;
+ * When ArchiveRecoveryRequested is ARCHIVE_RECOVERY_REQUEST_YES, archive
+ * recovery was requested, ie. signal files were present. When InArchiveRecovery
+ * is set, we are currently recovering using offline XLOG archives.
+ *
+ * When ArchiveRecoveryRequested is ARCHIVE_RECOVERY_REQUEST_YES, but
+ * InArchiveRecovery is false, we're currently performing crash recovery using
+ * only XLOG files in pg_wal, but will switch to using offline XLOG archives as
+ * soon as we reach the end of WAL in pg_wal.
+ *
+ * InArchiveRecovery only valid in the startup process. ArchiveRecoveryRequested
+ * can be acccsed through ArchiveRecoveryIsRequested().
+ */
+#define ARCHIVE_RECOVERY_REQUEST_UNKOWN		-1
+#define ARCHIVE_RECOVERY_REQUEST_NO			0
+#define ARCHIVE_RECOVERY_REQUEST_YES		1
+
+static int	ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
 bool		InArchiveRecovery = false;
 
 static bool standby_signal_file_found = false;
@@ -637,6 +643,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+	 * requested. Protected by info_lck.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -4455,7 +4467,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
+			if (!InArchiveRecovery && ArchiveRecoveryIsRequested() &&
 				!fetching_ckpt)
 			{
 				ereport(DEBUG1,
@@ -5223,6 +5235,7 @@ XLOGShmemInit(void)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
@@ -5485,16 +5498,16 @@ readRecoverySignalFile(void)
 	}
 
 	StandbyModeRequested = false;
-	ArchiveRecoveryRequested = false;
+	ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_NO;
 	if (standby_signal_file_found)
 	{
 		StandbyModeRequested = true;
-		ArchiveRecoveryRequested = true;
+		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_YES;
 	}
 	else if (recovery_signal_file_found)
 	{
 		StandbyModeRequested = false;
-		ArchiveRecoveryRequested = true;
+		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_YES;
 	}
 	else
 		return;
@@ -5507,12 +5520,18 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.  A lock is not
+	 * needed since we are the only ones who updating this.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = (bool) ArchiveRecoveryRequested;
 }
 
 static void
 validateRecoveryParameters(void)
 {
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return;
 
 	/*
@@ -5750,7 +5769,7 @@ recoveryStopsBefore(XLogReaderState *record)
 	 * Ignore recovery target settings when not in archive recovery (meaning
 	 * we are in crash recovery).
 	 */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return false;
 
 	/* Check if we should stop as soon as reaching consistency */
@@ -5897,7 +5916,7 @@ recoveryStopsAfter(XLogReaderState *record)
 	 * Ignore recovery target settings when not in archive recovery (meaning
 	 * we are in crash recovery).
 	 */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return false;
 
 	info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
@@ -6211,7 +6230,7 @@ recoveryApplyDelay(XLogReaderState *record)
 		return false;
 
 	/* nothing to do if crash recovery is requested */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return false;
 
 	/*
@@ -6455,7 +6474,7 @@ CheckRequiredParameterValues(void)
 	 * For archive recovery, the WAL must be generated with at least 'replica'
 	 * wal_level.
 	 */
-	if (ArchiveRecoveryRequested && ControlFile->wal_level == WAL_LEVEL_MINIMAL)
+	if (ArchiveRecoveryIsRequested() && ControlFile->wal_level == WAL_LEVEL_MINIMAL)
 	{
 		ereport(FATAL,
 				(errmsg("WAL was generated with wal_level=minimal, cannot continue recovering"),
@@ -6467,7 +6486,7 @@ CheckRequiredParameterValues(void)
 	 * For Hot Standby, the WAL must be generated with 'replica' mode, and we
 	 * must have at least as many backend slots as the primary.
 	 */
-	if (ArchiveRecoveryRequested && EnableHotStandby)
+	if (ArchiveRecoveryIsRequested() && EnableHotStandby)
 	{
 		/* We ignore autovacuum_max_workers when we make this test. */
 		RecoveryRequiresIntParameter("max_connections",
@@ -6633,7 +6652,7 @@ StartupXLOG(void)
 	readRecoverySignalFile();
 	validateRecoveryParameters();
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		if (StandbyModeRequested)
 			ereport(LOG,
@@ -6666,7 +6685,7 @@ StartupXLOG(void)
 	 * Take ownership of the wakeup latch if we're going to sleep during
 	 * recovery.
 	 */
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/* Set up XLOG reader facility */
@@ -6833,7 +6852,7 @@ StartupXLOG(void)
 		 * to minRecoveryPoint, up to backupEndPoint, or until we see an
 		 * end-of-backup record), and we can enter archive recovery directly.
 		 */
-		if (ArchiveRecoveryRequested &&
+		if (ArchiveRecoveryIsRequested() &&
 			(ControlFile->minRecoveryPoint != InvalidXLogRecPtr ||
 			 ControlFile->backupEndRequired ||
 			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
@@ -7063,7 +7082,7 @@ StartupXLOG(void)
 	}
 	else if (ControlFile->state != DB_SHUTDOWNED)
 		InRecovery = true;
-	else if (ArchiveRecoveryRequested)
+	else if (ArchiveRecoveryIsRequested())
 	{
 		/* force recovery due to presence of recovery signal file */
 		InRecovery = true;
@@ -7229,7 +7248,7 @@ StartupXLOG(void)
 		 * control file and we've established a recovery snapshot from a
 		 * running-xacts WAL record.
 		 */
-		if (ArchiveRecoveryRequested && EnableHotStandby)
+		if (ArchiveRecoveryIsRequested() && EnableHotStandby)
 		{
 			TransactionId *xids;
 			int			nxids;
@@ -7646,7 +7665,7 @@ StartupXLOG(void)
 		 * This check is intentionally after the above log messages that
 		 * indicate how far recovery went.
 		 */
-		if (ArchiveRecoveryRequested &&
+		if (ArchiveRecoveryIsRequested() &&
 			recoveryTarget != RECOVERY_TARGET_UNSET &&
 			!reachedRecoveryTarget)
 			ereport(FATAL,
@@ -7674,7 +7693,7 @@ StartupXLOG(void)
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 		DisownLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/*
@@ -7725,7 +7744,7 @@ StartupXLOG(void)
 		 * crashes while an online backup is in progress. We must not treat
 		 * that as an error, or the database will refuse to start up.
 		 */
-		if (ArchiveRecoveryRequested || ControlFile->backupEndRequired)
+		if (ArchiveRecoveryIsRequested() || ControlFile->backupEndRequired)
 		{
 			if (ControlFile->backupEndRequired)
 				ereport(FATAL,
@@ -7771,7 +7790,7 @@ StartupXLOG(void)
 	 * In a normal crash recovery, we can just extend the timeline we were in.
 	 */
 	PrevTimeLineID = ThisTimeLineID;
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		char	   *reason;
 		char		recoveryPath[MAXPGPATH];
@@ -7899,7 +7918,7 @@ StartupXLOG(void)
 		 * after we're fully out of recovery mode and already accepting
 		 * queries.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
 			LocalPromoteIsTriggered)
 		{
 			promoted = true;
@@ -7924,7 +7943,7 @@ StartupXLOG(void)
 		}
 	}
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		/*
 		 * And finally, execute the recovery_end_command, if any.
@@ -8003,6 +8022,15 @@ StartupXLOG(void)
 				XLogArchiveNotify(partialfname);
 			}
 		}
+
+		/*
+		 * Done with archive recovery request, clear the shared memory state
+		 * which no longer needed.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		XLogCtl->SharedArchiveRecoveryRequested = false;
+		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+		SpinLockRelease(&XLogCtl->info_lck);
 	}
 
 	/*
@@ -8263,6 +8291,32 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Is the archive recovery is requested?
+ *
+ * If ArchiveRecoveryRequested is unknown, then it will be updated by checking
+ * shared memory. Like PromoteIsTriggered(), this works in any process that's
+ * connected to shared memory.
+ */
+bool
+ArchiveRecoveryIsRequested(void)
+{
+	/*
+	 * If not UNKNOWN, the ArchiveRecoveryRequested value either
+	 * ARCHIVE_RECOVERY_REQUEST_YES => 1 or ARCHIVE_RECOVERY_REQUEST_NO => 0
+	 * which can be coerced to boolean true or false respectively.
+	 */
+	if (likely(ArchiveRecoveryRequested != ARCHIVE_RECOVERY_REQUEST_UNKOWN))
+		return (bool) ArchiveRecoveryRequested;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ArchiveRecoveryRequested = XLogCtl->SharedArchiveRecoveryRequested ?
+		ARCHIVE_RECOVERY_REQUEST_YES : ARCHIVE_RECOVERY_REQUEST_NO;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return (bool) ArchiveRecoveryRequested;
+}
+
 /*
  * Returns current recovery state from shared memory.
  *
@@ -10174,7 +10228,7 @@ xlog_redo(XLogReaderState *record)
 		 * record, the backup was canceled and the end-of-backup record will
 		 * never arrive.
 		 */
-		if (ArchiveRecoveryRequested &&
+		if (ArchiveRecoveryIsRequested() &&
 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
 			ereport(PANIC,
@@ -12176,7 +12230,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster)
+		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e754b..756d03adb6f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -67,7 +67,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	 * Ignore restore_command when not in archive recovery (meaning we are in
 	 * crash recovery).
 	 */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		goto not_available;
 
 	/* In standby mode, restore_command might not be supplied */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700de..0a356c98d1f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -289,6 +289,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool ArchiveRecoveryIsRequested(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 3b5eceff658..2051953d404 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -319,7 +319,6 @@ extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
  * Exported for the functions in timeline.c and xlogarchive.c.  Only valid
  * in the startup process.
  */
-extern bool ArchiveRecoveryRequested;
 extern bool InArchiveRecovery;
 extern bool StandbyMode;
 extern char *recoveryRestoreCommand;
-- 
2.18.0

v35-0003-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v35-0003-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From b5e6bd6fd572b23a087adfcc448e180e91f6d4ce Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 07:40:07 -0400
Subject: [PATCH v35 3/5] Refactor some end-of-recovery code out of
 StartupXLOG().

Moved the code that performs whether to write a checkpoint or an
end-of-recovery record into PerformRecoveryXlogAction().

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 285 ++++++++++++++++--------------
 1 file changed, 154 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d5bf9a72c..12f7e080d3e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -892,6 +892,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -937,6 +938,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5713,6 +5715,101 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(void)
+{
+	XLogRecPtr	EndOfLog;
+
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be pre-allocated
+	 * files containing garbage. In any case, they are not part of the new
+	 * timeline's history so we don't need them.
+	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because it
+	 * was hit by a meteor), it will never make it to the archive. That's OK
+	 * from our point of view, because the new segment that we created with the
+	 * new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically present
+	 * in the new file with new TLI, but recovery won't look there when it's
+	 * recovering to the older timeline. On the other hand, if we archive the
+	 * partial segment, and the original server on that timeline is still
+	 * running and archives the completed version of the same segment later, it
+	 * will fail. (We used to do that in 9.4 and below, and it caused such
+	 * problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix, and
+	 * archive it. Archive recovery will never try to read .partial segments, so
+	 * they will normally go unused. But in the odd PITR case, the administrator
+	 * can copy them manually to the pg_wal directory (removing the suffix).
+	 * They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let it
+	 * be archived normally. (In particular, if it was restored from the archive
+	 * to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+		TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+
+	/*
+	 * Done with archive recovery request, clear the shared memory state which
+	 * no longer needed.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedArchiveRecoveryRequested = false;
+	ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7907,139 +8004,11 @@ StartupXLOG(void)
 	 * shut down cleanly, which been through recovery.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
-			PromoteIsTriggered())
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+		promoted = PerformRecoveryXLogAction();
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryIsRequested())
-	{
-		XLogRecPtr	EndOfLog;
-
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		(void) GetLastSegSwitchData(&EndOfLog);
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-			TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-
-		/*
-		 * Done with archive recovery request, clear the shared memory state
-		 * which no longer needed.
-		 */
-		SpinLockAcquire(&XLogCtl->info_lck);
-		XLogCtl->SharedArchiveRecoveryRequested = false;
-		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
-		SpinLockRelease(&XLogCtl->info_lck);
-	}
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8243,6 +8212,60 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static bool
+PerformRecoveryXLogAction(void)
+{
+	bool		promoted = false;
+
+	/*
+	 * Perform a checkpoint to update all our recovery activity to disk.
+	 *
+	 * Note that we write a shutdown checkpoint rather than an on-line one. This
+	 * is not particularly critical, but since we may be assigning a new TLI,
+	 * using a shutdown checkpoint allows us to have the rule that TLI only
+	 * changes in shutdown checkpoints, which allows some extra error checking
+	 * in xlog_redo.
+	 *
+	 * In promotion, only create a lightweight end-of-recovery record instead of
+	 * a full checkpoint. A checkpoint is requested later, after we're fully out
+	 * of recovery mode and already accepting queries.
+	 */
+	if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
+		PromoteIsTriggered())
+	{
+		promoted = true;
+
+		/*
+		 * Insert a special WAL record to mark the end of recovery, since we
+		 * aren't doing a checkpoint. That means that the checkpointer process
+		 * may likely be in the middle of a time-smoothed restartpoint and could
+		 * continue to be for minutes after this.  That sounds strange, but the
+		 * effect is roughly the same and it would be stranger to try to come
+		 * out of the restartpoint and then checkpoint. We request a checkpoint
+		 * later anyway, just for safety.
+		 */
+		CreateEndOfRecoveryRecord();
+	}
+	else
+	{
+		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+						  CHECKPOINT_IMMEDIATE |
+						  CHECKPOINT_WAIT);
+	}
+
+	return promoted;
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

#163

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Amul Sul (#152)

11 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi Mark,

I have tried to fix your review comment in the attached version,
please see my inline reply below.

On Fri, Sep 10, 2021 at 8:06 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Thank you, for looking at the patch. Please see my reply inline below:

On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote:

Here is the rebased version.

v33-0004

This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectly included from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merely to get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary, but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c, xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c.

Yes, you are correct, xlog.h is included in more than 150 files. I was
wondering if we can have a forward declaration instead of including
pg_control.h (e.g. The same way struct XLogRecData was declared in
xlog.h). Perhaps, DBState is enum & I don't see we have done the same
for enum elsewhere as we are doing for structures, but that seems to
be fine, IMO.

Earlier, I was unsure before preparing this patch, but since that
makes sense (I assumed) and minimizes duplications, can we go ahead
and post separately with the same change in StartupXLOG() which I have
skipped for the same reason mentioned in patch commit-msg.

FYI, I have posted this patch separately [1] & drop it from the current set.

v33-0005
The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e. wal writes permitted".

Fixed.

The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actually doing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process".

I am not sure I understood the concern, what comments should you
think? This function helps to signal the checkpointer, but doesn't
tell what it is supposed to do.

The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a reader of the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up wal probibit shared state"

Done.

The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests, we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I think this one should read "stop processing". This same comment is used again below. Then a third comment reads "For the same reason mentioned previously for the wal prohibit state change request check." That third comment is too glib.

Ok, "stop processing" is used. I think the third comment should be
fine instead of coping the same again, however, I change that comment
a bit for more clarity as "For the same reason mentioned previously
for the same function call".

tcop/utility.c needlessly includes "access/walprohibit.h"

wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a state change is clear enough. But how is waiting on a state different?

WAIT_EVENT_WALPROHIBIT_STATE_CHANGE gets set in pg_prohibit_wal()
while waiting for the system to prohibit state change.
WAIT_EVENT_WALPROHIBIT_STATE is set for the checkpointer process when
it sees the system is in a WAL PROHIBITED state & stops there. But I
think it makes sense to have only one, i.e.
WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. The same can be used for
checkpointer since it won't do anything until wal prohibited state
change.

Remove WAIT_EVENT_WALPROHIBIT_STATE in the attached version.

xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names of the values:

/* State of work that enables wal writes */
typedef enum XLogAcceptWritesState
{
XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */
XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */
} XLogAcceptWritesState;

This enum seems to have been written from the point of view of someone who already knew what it was for. It needs to be written in a way that will be clear to people who have no idea what it is for.

I tried to avoid the function name in the comment, since the enum name
pretty much resembles the XLogAcceptWrite() function name whose
execution state we are trying to track, but added now, that would be
much clearer.

v33-0006:

The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes" reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */

Rephrased the comments but I think HAVE XID is much more appropriate
there because that assert function name ends with HaveXID.

Apart from this I have moved CheckWALPermitted() closer to
START_CRIT_SECTION which you have pointed out in your other post and
made a few other changes. Note that patch numbers are changed, I have
rebased my implementation on top of the under discussion refactoring
patches which I have posted previously [2] and reattached the same
here to make CFbot continue its testing.

Note that with the current version patch on the latest master head
getting one issue but can be seen sometimes only where one, the same
INSERT query gets stuck, waiting for WALWriteLock in exclusive mode. I
am not sure if it is due to my changes, but that is not occurring without
my patch. I am looking into that, just in case if anybody wants to
know more, I have attached the backtrace, pg_lock & ps output, see
ps-bt-pg_lock.out.text attached file.

Regards,
Amul

1] /messages/by-id/CAAJ_b97nd_ghRpyFV9Djf9RLXkoTbOUqnocq11WGq9TisX09Fw@mail.gmail.com
2] /messages/by-id/CAAJ_b96G-oBxDC3C7Y72ER09bsheGHOxBK1HXHVOyHNXjTDmcA@mail.gmail.com

Attachments:

ps-bt-pg_lock.out.texttext/plain; charset=US-ASCII; name=ps-bt-pg_lock.out.textDownload

$ ps -ef | grep postgres
amul      46779      1  0 08:17 ?        00:00:00 /home/amul/work/source/PG/tmp_install/tmp/RM5444/inst/bin/postgres -F -c listen_addresses= -k /tmp/pg_upgrade_check-DpL1Pk
amul      46783  46779  0 08:17 ?        00:00:00 postgres: checkpointer 
amul      46784  46779  0 08:17 ?        00:00:00 postgres: background writer 
amul      46787  46779  0 08:17 ?        00:00:00 postgres: walwriter 
amul      46789  46779  0 08:17 ?        00:00:00 postgres: stats collector 
amul      50246  46779  0 08:17 ?        00:00:00 postgres: amul regression [local] INSERT
amul      54260  46779  0 08:18 ?        00:00:00 postgres: autovacuum worker regression
amul      59474  46779  0 08:18 ?        00:00:00 postgres: autovacuum worker postgres
amul      62444  46779  0 08:19 ?        00:00:00 postgres: autovacuum worker regression

==============================
bt:

(gdb) bt
#0  0x00007ff852901afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007ff852901b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007ff852901c2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000079f6e2 in PGSemaphoreLock () at pg_sema.c:327
#4  0x000000000082a794 in LWLockAcquire () at lwlock.c:1318
#5  0x00000000005c2110 in AdvanceXLInsertBuffer (upto=182534144, opportunistic=false) at xlog.c:2203
#6  0x00000000005c2273 in GetXLogBuffer () at xlog.c:1982
#7  0x00000000005c2b44 in CopyXLogRecordToWAL (EndPos=182534176, StartPos=182534024, rdata=0x293ba50, isLogSwitch=false, write_len=124) at xlog.c:1581
#8  XLogInsertRecord () at xlog.c:1146
#9  0x00000000005ccadf in XLogInsert () at xloginsert.c:866
#10 0x0000000000557ae1 in heap_insert (relation=relation@entry=0x7ff853308c08, tup=tup@entry=0x2a648c8, cid=cid@entry=44, options=options@entry=0, bistate=bistate
@entry=0x0) at heapam.c:2211
#11 0x0000000000565596 in heapam_tuple_insert (relation=0x7ff853308c08, slot=0x2a46600, cid=44, options=0, bistate=0x0) at heapam_handler.c:252
#12 0x00000000006e62e5 in table_tuple_insert (bistate=0x0, options=0, cid=<optimized out>, slot=0x2a46600, rel=0x7ff853308c08) at ../../../src/include/access/tabl
eam.h:1374
#13 ExecInsert () at nodeModifyTable.c:934
#14 0x00000000006e6ddc in ExecModifyTable () at nodeModifyTable.c:2561
#15 0x00000000006b7ff3 in ExecProcNode (node=0x2a2fa68) at ../../../src/include/executor/executor.h:257
#16 ExecutePlan (execute_once=<optimized out>, dest=0x2a363f0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_INSERT, use_pa
rallel_mode=<optimized out>, planstate=0x2a2fa68, estate=0x2a2f7e0) at execMain.c:1551
#17 standard_ExecutorRun () at execMain.c:361
#18 0x000000000083e540 in ProcessQuery (plan=<optimized out>, sourceText=0x29060c0 "insert into bmscantest select r, 'f", 'o' <repeats 63 times>, "' FROM generate
_series(1,100000) r;", params=0x0, queryEnv=0x0, dest=0x2a363f0, qc=0x7fff6ae822d0) at pquery.c:160
#19 0x000000000083f05a in PortalRunMulti (portal=portal@entry=0x2969d10, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@
entry=0x2a363f0, altdest=altdest@entry=0x2a363f0, qc=qc@entry=0x7fff6ae822d0) at pquery.c:1266
#20 0x000000000083f557 in PortalRun () at pquery.c:786
#21 0x000000000083b4e5 in exec_simple_query () at postgres.c:1214
#22 0x000000000083cab9 in PostgresMain () at postgres.c:4497
#23 0x00000000007b239a in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4560
#24 BackendStartup (port=<optimized out>) at postmaster.c:4288
#25 ServerLoop () at postmaster.c:1801
#26 0x00000000007b336b in PostmasterMain () at postmaster.c:1473

==============================

regression=# select * from pg_locks;
   locktype    | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |           mode           | granted | fastpath | waitstart 
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+----------+-----------
 relation      |    16387 |    12073 |      |       |            |               |         |       |          | 7/2193             | 65034 | AccessShareLock          | t       | t        | 
 virtualxid    |          |          |      |       | 7/2193     |               |         |       |          | 7/2193             | 65034 | ExclusiveLock            | t       | t        | 
 relation      |    16387 |    19473 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    19462 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    19447 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17564 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    19402 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    19388 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    19375 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    19341 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17561 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17640 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17637 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17632 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17629 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17624 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 relation      |    16387 |    17621 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | t        | 
 virtualxid    |          |          |      |       | 3/4997     |               |         |       |          | 3/4997             | 50246 | ExclusiveLock            | t       | t        | 
 relation      |    16387 |     2696 |      |       |            |               |         |       |          | 6/1653             | 62444 | RowExclusiveLock         | t       | t        | 
 relation      |    16387 |     2619 |      |       |            |               |         |       |          | 6/1653             | 62444 | RowExclusiveLock         | t       | t        | 
 relation      |    16387 |    16984 |      |       |            |               |         |       |          | 6/1653             | 62444 | AccessShareLock          | t       | t        | 
 virtualxid    |          |          |      |       | 6/1653     |               |         |       |          | 6/1653             | 62444 | ExclusiveLock            | t       | t        | 
 virtualxid    |          |          |      |       | 5/2721     |               |         |       |          | 5/2721             | 59474 | ExclusiveLock            | t       | t        | 
 relation      |    16387 |     2704 |      |       |            |               |         |       |          | 4/1627             | 54260 | RowExclusiveLock         | t       | t        | 
 relation      |    16387 |     2703 |      |       |            |               |         |       |          | 4/1627             | 54260 | RowExclusiveLock         | t       | t        | 
 virtualxid    |          |          |      |       | 4/1627     |               |         |       |          | 4/1627             | 54260 | ExclusiveLock            | t       | t        | 
 relation      |    16387 |    16709 |      |       |            |               |         |       |          | 6/1653             | 62444 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |        0 |     2677 |      |       |            |               |         |       |          | 5/2721             | 59474 | RowExclusiveLock         | t       | f        | 
 object        |    16387 |          |      |       |            |               |    1247 | 28328 |        0 | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 object        |    16387 |          |      |       |            |               |    1247 | 28327 |        0 | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 relation      |    16387 |    27185 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    27194 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    17629 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    17564 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 transactionid |          |          |      |       |            |          9955 |         |       |          | 6/1653             | 62444 | ExclusiveLock            | t       | f        | 
 object        |    16387 |          |      |       |            |               |    1247 | 28322 |        0 | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 relation      |    16387 |    17561 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    17627 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    17637 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    28333 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 relation      |    16387 |    28332 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareLock                | t       | f        | 
 relation      |    16387 |    17635 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    17643 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    17632 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    28323 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | f        | 
 relation      |    16387 |    28323 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 object        |    16387 |          |      |       |            |               |    1247 | 28321 |        0 | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 object        |    16387 |          |      |       |            |               |    1247 | 28325 |        0 | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 relation      |    16387 |    17624 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    28326 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | f        | 
 relation      |    16387 |    28326 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 relation      |    16387 |     1247 |      |       |            |               |         |       |          | 4/1627             | 54260 | ShareUpdateExclusiveLock | t       | f        | 
 transactionid |          |          |      |       |            |          9954 |         |       |          | 3/4997             | 50246 | ExclusiveLock            | t       | f        | 
 relation      |        0 |     2676 |      |       |            |               |         |       |          | 5/2721             | 59474 | RowExclusiveLock         | t       | f        | 
 relation      |    16387 |    17621 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    28329 |      |       |            |               |         |       |          | 3/4997             | 50246 | RowExclusiveLock         | t       | f        | 
 relation      |    16387 |    28329 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 object        |    16387 |          |      |       |            |               |    2615 |  2200 |        0 | 3/4997             | 50246 | AccessShareLock          | t       | f        | 
 relation      |    16387 |    27190 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |        0 |     1260 |      |       |            |               |         |       |          | 5/2721             | 59474 | ShareUpdateExclusiveLock | t       | f        | 
 object        |    16387 |          |      |       |            |               |    1247 | 28324 |        0 | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
 relation      |    16387 |    17640 |      |       |            |               |         |       |          | 3/4997             | 50246 | ShareUpdateExclusiveLock | t       | f        | 
 relation      |    16387 |    28320 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessShareLock          | t       | f        | 
 relation      |    16387 |    28320 |      |       |            |               |         |       |          | 3/4997             | 50246 | AccessExclusiveLock      | t       | f        | 
(64 rows)

v35-0010-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/x-patch; name=v35-0010-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From 114a72d2bc8fe01b08a03b0298b232d555185a3c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v35 10/10] Test: Few tap tests for wal prohibited system

Does following testing:

1. Basic verification like insert into normal and unlogged table on
   wal prohibited system.
2. Check permission to non-superuser to alter wal prohibited system
   state.
3. Verify open write transaction disconnection when system state has
   been changed to wal prohibited.
4. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
5. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
6. Standby server cannot be in wal prohibited, standby.signal and/or
   recovery.signal take out system from wal prohibited state.
7. Terminate session running transaction performed write but not
   committed yet while changing state to WAL prohibited.
---
 src/test/recovery/t/026_pg_prohibit_wal.pl | 213 +++++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100644 src/test/recovery/t/026_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/026_pg_prohibit_wal.pl b/src/test/recovery/t/026_pg_prohibit_wal.pl
new file mode 100644
index 00000000000..10945025217
--- /dev/null
+++ b/src/test/recovery/t/026_pg_prohibit_wal.pl
@@ -0,0 +1,213 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test wal prohibited state.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Config;
+use Test::More tests => 22;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize database node
+my $node_primary = PostgresNode->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Create few tables and insert some data
+$node_primary->safe_psql('postgres',  <<EOSQL);
+CREATE TABLE tab AS SELECT i FROM generate_series(1,5) i;
+CREATE UNLOGGED TABLE unlogtab AS SELECT i FROM generate_series(1,5) i;
+EOSQL
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is now wal prohibited');
+
+#
+# In wal prohibited state, further table insert will fail.
+#
+# Note that even though inter into unlogged and temporary table doesn't generate
+# wal but the transaction does that insert operation will acquire transaction id
+# which is not allowed on wal prohibited system. Also, that transaction's abort
+# or commit state will be wal logged at the end which is prohibited as well.
+#
+my ($stdout, $stderr, $timed_out);
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, table insert is failed');
+$node_primary->psql('postgres', 'INSERT INTO unlogtab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, unlogged table insert is failed');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on server, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint, last checkpoint lsn : $checkpoint_lsn");
+
+# Change server to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+is($new_checkpoint_lsn ne $checkpoint_lsn, 1,
+	"new checkpoint performed, new checkpoint lsn : $new_checkpoint_lsn");
+
+my $new_write_lsn = $node_primary->lsn('write');
+is($new_write_lsn ne $write_lsn, 1,
+	"new wal writes on server, new latest wal write lsn : $new_write_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '6',
+	'table insert passed');
+
+# Only the superuser and the user who granted permission able to call
+# pg_prohibit_wal to change wal prohibited state.
+$node_primary->safe_psql('postgres', 'CREATE USER non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+like($stderr, qr/permission denied for function pg_prohibit_wal/,
+	'permission denied to non-superuser for alter wal prohibited state');
+$node_primary->safe_psql('postgres', 'GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'granted permission to non-superuser, able to alter wal prohibited state');
+
+# back to normal state
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(false)');
+
+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$mysession_stdin,
+	'>',
+	\$mysession_stdout,
+	'2>',
+	\$mysession_stderr,
+	$psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(7);
+SELECT $$value-7-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-7-inserted-into-tab/,
+	'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+	'session with open write transaction is terminated');
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgresNode->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Recovery server cannot be put into wal prohibited state.
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '7',
+	'insert passed on primary');
+
+# Wait for standbys to catch up
+$node_primary->wait_for_catchup($node_standby, 'write');
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '7',
+	'new insert replicated on standby as well');
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v35-0008-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v35-0008-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From e2127666277a8dae7bafb0ead5a107ddc7bf3b5c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v35 08/10] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++-
 src/backend/access/brin/brin.c            |  4 ++
 src/backend/access/brin/brin_pageops.c    | 21 ++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 ++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 +++++++--
 src/backend/access/gin/ginfast.c          | 11 +++++-
 src/backend/access/gin/gininsert.c        |  4 ++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 +++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 ++++++-
 src/backend/access/hash/hash.c            | 19 +++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++--
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 ++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 +++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++--
 src/backend/access/heap/visibilitymap.c   | 22 ++++++++++-
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 10 ++++-
 src/backend/access/nbtree/nbtpage.c       | 34 ++++++++++++++--
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 +++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 ++
 src/backend/access/transam/walprohibit.c  | 10 +++++
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 26 +++++++++----
 src/backend/access/transam/xloginsert.c   | 21 ++++++++--
 src/backend/commands/sequence.c           | 16 ++++++++
 src/backend/postmaster/checkpointer.c     | 11 ++++++
 src/backend/storage/buffer/bufmgr.c       | 10 +++--
 src/backend/storage/freespace/freespace.c | 10 ++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 47 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 +++++++++++++
 39 files changed, 514 insertions(+), 68 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..a3718246588 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index fbccf3d038d..e252b2c22a8 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
 			computeLeafRecompressWALData(leaf);
+			CheckWALPermitted();
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..76630b12490 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..5c7b5fc9e9d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6d2d71be32b..7b321c69880 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..e57e83c8c4d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index eb3810494f2..a47a3dd84cc 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b1..4ea7b1c934f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b312af57e11..197d226f2ec 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+		CheckWALPermitted();
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
 						XLogEnsureRecordSpace(0, 3 + nitups);
+						CheckWALPermitted();
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 159646c7c3e..d1989e93b35 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 972fdbcb92f..37ab08cad93 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a0..0cb9adf8b5d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9eaf07649e8..c6809f5a3e5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1338,6 +1339,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1353,8 +1359,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1958,8 +1963,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1984,7 +1994,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2417,6 +2427,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2427,6 +2438,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2457,7 +2471,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..6cdca3b5918 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -272,12 +274,22 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
+		/*
+		 * Can reach here from VACUUM or from startup process, so need not have an
+		 * XID.
+		 *
+		 * Recovery in the startup process never have wal prohibit state, skip
+		 * wal permission check if reach here in the startup process.
+		 */
+		if (needwal)
+			InRecovery ? AssertWALPermitted() : CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +486,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +500,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +534,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 6401fce57b9..17fa7737725 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -235,6 +236,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7355e1dba13..99376db967f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1249,6 +1250,8 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		AssertWALPermittedHaveXID();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1899,13 +1902,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -2467,6 +2473,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b89..3ed7bb71e69 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c3..1519e4d233d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1107,6 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2212,6 +2215,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2310,6 +2316,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 49404f45a16..50df565386f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index eadb0f36c8c..a5aeee38901 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1317,6 +1318,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1677,6 +1680,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dc13c674318..80bcde11231 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1039,7 +1039,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*----------
 	 *
@@ -2890,9 +2890,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9422,6 +9424,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9587,6 +9592,8 @@ CreateEndOfRecoveryRecord(void)
 
 	LocalSetXLogInsertAllowed();
 
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10253,7 +10260,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10267,10 +10274,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10292,8 +10299,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b492c656d7a..38818cdba5a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -138,9 +139,20 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
-	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
+	/*
+	 * Cross-check on whether we should be here or not.
+	 *
+	 * This check is primarily for a non-critical section that never insists the
+	 * same WAL write permission check before reaching here.
+	 */
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -222,6 +234,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a77d592ce5d..11f1fe67c99 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -928,6 +928,17 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/*
+	 * Only end-of-recovery checkpoint is allowed in WAL prohibited state.
+	 *
+	 * This can be possible only when the system changing to wal permitted and
+	 * before that trying to finish the pending operation to accept WAL writes
+	 * which might have skipped previously while booting the system in WAL
+	 * prohibited state.
+	 */
+	if (!RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend or checkpointer, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b0..f8afa4cc26e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3888,13 +3888,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ff77a68552c..f347b7ed40d 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -56,4 +57,50 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	/*
+	 * Recovery in the startup process is never in wal prohibited state.
+	 */
+	Assert(InRecovery || XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a30160657..b438ec31fc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v35-0006-Allow-RequestCheckpoint-call-from-checkpointer-p.patchapplication/x-patch; name=v35-0006-Allow-RequestCheckpoint-call-from-checkpointer-p.patchDownload

From d3aa611a89c58de4a5a8139ccd7cb7890c0dfc7f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 21 Sep 2021 06:05:36 -0400
Subject: [PATCH v35 06/10] Allow RequestCheckpoint() call from checkpointer
 process

---
 src/backend/postmaster/checkpointer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..9af7dec9212 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,9 +924,9 @@ RequestCheckpoint(int flags)
 				old_started;
 
 	/*
-	 * If in a standalone backend, just do it ourselves.
+	 * If in a standalone backend or checkpointer, just do it ourselves.
 	 */
-	if (!IsPostmasterEnvironment)
+	if (!IsPostmasterEnvironment || AmCheckpointerProcess())
 	{
 		/*
 		 * There's no point in doing slow checkpoints in a standalone backend,
-- 
2.18.0

v35-0007-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v35-0007-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 51fba084a6978fca5ca9ccf8448274e6f254e32c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v35 07/10] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 480 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 180 ++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   8 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  21 +
 src/backend/storage/ipc/ipci.c           |   7 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  32 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  59 +++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   3 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 857 insertions(+), 73 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..49404f45a16
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,480 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ *
+ *	NB: Function always returns true that leaves scope for the future code
+ *	changes might need to return false for some reason.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_BOOL(true);
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_BOOL(true);		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+				{
+					HoldWALProhibitStateTransition = true;
+					PerformPendingXLogAcceptWrites();
+					HoldWALProhibitStateTransition = false;
+				}
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6597ec45a95..eadb0f36c8c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1962,23 +1962,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 15dd322d21d..dc13c674318 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -232,9 +233,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -735,6 +737,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -4966,6 +4974,17 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+	void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5243,6 +5262,7 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6414,6 +6434,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6779,13 +6808,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryIsRequested())
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8048,8 +8094,29 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 
-	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites();
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		promoted = XLogAcceptWrites();
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8101,6 +8168,20 @@ XLogAcceptWrites(void)
 {
 	bool		promoted = false;
 
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
+
 	/* Write an XLOG_FPW_CHANGE record */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
@@ -8109,8 +8190,10 @@ XLogAcceptWrites(void)
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
 	 * shut down cleanly, which been through recovery.
+	 *
+	 * TODO: XXX: need good fix for checkpointer.
 	 */
-	if (ControlFile->state != DB_SHUTDOWNED)
+	if (ControlFile->state != DB_SHUTDOWNED || AmCheckpointerProcess())
 		promoted = PerformRecoveryXLogAction();
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
@@ -8130,9 +8213,41 @@ XLogAcceptWrites(void)
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	ResetLocalXLogInsertAllowed();
+
+	/* Prepare to accept WAL writes. */
+	(void) XLogAcceptWrites();
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+ }
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8427,9 +8542,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8448,9 +8563,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8472,6 +8598,12 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8761,9 +8893,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8776,6 +8912,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9025,8 +9164,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index a416e94d371..0934478188e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -699,6 +699,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 3b3df8fa8cc..1b6787a52c3 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,12 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the wal writes are permitted.  Second, we
+		 * need to make sure that there is a worker slot available.  Third, we
+		 * need to make sure that no other worker failed while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9af7dec9212..a77d592ce5d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -348,6 +349,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -692,6 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1341,3 +1346,19 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows any process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9fa3e0631e6..9b391cb9cc2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -247,6 +248,12 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up shared memory structure need to handle concurrent WAL prohibit
+	 * state change requests.
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 364654e1060..c5d8edd82bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 4a2ed414b00..17eb1e58e3d 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -236,10 +237,17 @@ SyncPostCheckpoint(void)
 		pfree(entry);
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
-		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop
-		 * (note it might try to delete list entries).
+		 * As in ProcessSyncRequests, we don't want to stop processing wal
+		 * prohibit change requests for a long time when there are many
+		 * deletions to be done.  It needs to be check and processed by
+		 * checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -278,6 +286,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -336,6 +347,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop processing wal prohibit change requests for a long
+		 * time when there are many fsync requests to be processed.  It needs to
+		 * be check and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -422,6 +440,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the same
+				 * function call.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index bf085aa93b2..4f4d07ec558 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ef7e6bfb779..e9b9ee0a0fd 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
 			event_name = "LogicalSubxactWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d2ce4a84506..0cda0e690af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -234,6 +235,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -676,6 +678,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2118,6 +2121,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12559,4 +12574,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ff77a68552c
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,59 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8ea4e583980..02c6a022d77 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -133,6 +133,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of XLogAcceptWrites() execution */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped XLogAcceptWrites() */
+	XLOG_ACCEPT_WRITES_DONE			/* done with XLogAcceptWrites() */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -281,6 +289,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -290,8 +299,10 @@ extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
 extern bool ArchiveRecoveryIsRequested(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern uint64 GetSystemIdentifier(void);
 extern char *GetMockAuthenticationNonce(void);
 extern bool DataChecksumsEnabled(void);
@@ -302,6 +313,7 @@ extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce7..f6a1f3b9826 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532ec..78da6229168 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11645,6 +11645,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'change server to permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'bool',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6007827b445..3308feacfd3 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -225,7 +225,8 @@ typedef enum
 	WAIT_EVENT_LOGICAL_CHANGES_READ,
 	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
 	WAIT_EVENT_LOGICAL_SUBXACT_READ,
-	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 402a6617a98..5d48f3238c9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2827,6 +2827,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v35-0009-Documentation.patchapplication/x-patch; name=v35-0009-Documentation.patchDownload

From ec2d42a434509f21de3b6b4c949d000c824971e8 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v35 09/10] Documentation.

---
 doc/src/sgml/func.sgml              | 20 ++++++++++
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 doc/src/sgml/monitoring.sgml        |  4 ++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 5 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbeb..ccd71c25a08 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25343,6 +25343,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c43f2140205..98b660941b1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f3..eb68a724944 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1565,6 +1565,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalSubxactWrite</literal></entry>
       <entry>Waiting for a write to a logical subxact file.</entry>
      </row>
+     <row>
+      <entry><literal>SystemWALProhibitStateChange</literal></entry>
+      <entry>Waiting for a wal prohibited state change.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v35-0005-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v35-0005-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 90ad2e718350fad5723583ea1ec8fe2eb198428f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 07:40:44 -0400
Subject: [PATCH v35 05/10] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 75 +++++++++++++++++++------------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e8b1cf4ce20..15dd322d21d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -938,6 +938,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -8041,38 +8042,14 @@ StartupXLOG(void)
 	XLogReaderFree(xlogreader);
 
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
-	 * shut down cleanly, which been through recovery.
-	 */
-	if (ControlFile->state != DB_SHUTDOWNED)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryIsRequested())
-		CleanupAfterArchiveRecovery();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
 
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8116,6 +8093,46 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(void)
+{
+	bool		promoted = false;
+
+	/* Write an XLOG_FPW_CHANGE record */
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryIsRequested())
+		CleanupAfterArchiveRecovery();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v35-0004-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v35-0004-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From abb5903a32d31b5b8f5b14f3bb9ed28eaf979605 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 07:40:36 -0400
Subject: [PATCH v35 04/10] Postpone some end-of-recovery operations relating
 to allowing WAL.

Previously, moved the code that performs whether to write a checkpoint
or an end-of-recovery record into PerformRecoveryXlogAction(), and
code performs post-archive-recovery into CleanupAfterArchiveRecovery(),
but called both the functions from the same place. Now postpone that
stuff until after we clear InRecovery and shut down the XLogReader.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.

Robert Haas, with modifications by Amul Sul
---
 src/backend/access/transam/xlog.c | 42 +++++++++++++++----------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 12f7e080d3e..e8b1cf4ce20 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7989,27 +7989,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;

-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
-	 * shut down cleanly, which been through recovery.
-	 */
-	if (ControlFile->state != DB_SHUTDOWNED)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryIsRequested())
-		CleanupAfterArchiveRecovery();
-
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -8061,6 +8040,27 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);

+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryIsRequested())
+		CleanupAfterArchiveRecovery();
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

v35-0003-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v35-0003-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 66d34cbc409ad10c69975325701d72549fb94f66 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 07:40:07 -0400
Subject: [PATCH v35 03/10] Refactor some end-of-recovery code out of
 StartupXLOG().

Moved the code that performs whether to write a checkpoint or an
end-of-recovery record into PerformRecoveryXlogAction().

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 285 ++++++++++++++++--------------
 1 file changed, 154 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d5bf9a72c..12f7e080d3e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -892,6 +892,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -937,6 +938,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5713,6 +5715,101 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(void)
+{
+	XLogRecPtr	EndOfLog;
+
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be pre-allocated
+	 * files containing garbage. In any case, they are not part of the new
+	 * timeline's history so we don't need them.
+	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because it
+	 * was hit by a meteor), it will never make it to the archive. That's OK
+	 * from our point of view, because the new segment that we created with the
+	 * new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically present
+	 * in the new file with new TLI, but recovery won't look there when it's
+	 * recovering to the older timeline. On the other hand, if we archive the
+	 * partial segment, and the original server on that timeline is still
+	 * running and archives the completed version of the same segment later, it
+	 * will fail. (We used to do that in 9.4 and below, and it caused such
+	 * problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix, and
+	 * archive it. Archive recovery will never try to read .partial segments, so
+	 * they will normally go unused. But in the odd PITR case, the administrator
+	 * can copy them manually to the pg_wal directory (removing the suffix).
+	 * They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let it
+	 * be archived normally. (In particular, if it was restored from the archive
+	 * to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+		TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+
+	/*
+	 * Done with archive recovery request, clear the shared memory state which
+	 * no longer needed.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedArchiveRecoveryRequested = false;
+	ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7907,139 +8004,11 @@ StartupXLOG(void)
 	 * shut down cleanly, which been through recovery.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
-			PromoteIsTriggered())
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+		promoted = PerformRecoveryXLogAction();
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryIsRequested())
-	{
-		XLogRecPtr	EndOfLog;
-
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		(void) GetLastSegSwitchData(&EndOfLog);
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-			TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-
-		/*
-		 * Done with archive recovery request, clear the shared memory state
-		 * which no longer needed.
-		 */
-		SpinLockAcquire(&XLogCtl->info_lck);
-		XLogCtl->SharedArchiveRecoveryRequested = false;
-		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
-		SpinLockRelease(&XLogCtl->info_lck);
-	}
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8243,6 +8212,60 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static bool
+PerformRecoveryXLogAction(void)
+{
+	bool		promoted = false;
+
+	/*
+	 * Perform a checkpoint to update all our recovery activity to disk.
+	 *
+	 * Note that we write a shutdown checkpoint rather than an on-line one. This
+	 * is not particularly critical, but since we may be assigning a new TLI,
+	 * using a shutdown checkpoint allows us to have the rule that TLI only
+	 * changes in shutdown checkpoints, which allows some extra error checking
+	 * in xlog_redo.
+	 *
+	 * In promotion, only create a lightweight end-of-recovery record instead of
+	 * a full checkpoint. A checkpoint is requested later, after we're fully out
+	 * of recovery mode and already accepting queries.
+	 */
+	if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
+		PromoteIsTriggered())
+	{
+		promoted = true;
+
+		/*
+		 * Insert a special WAL record to mark the end of recovery, since we
+		 * aren't doing a checkpoint. That means that the checkpointer process
+		 * may likely be in the middle of a time-smoothed restartpoint and could
+		 * continue to be for minutes after this.  That sounds strange, but the
+		 * effect is roughly the same and it would be stranger to try to come
+		 * out of the restartpoint and then checkpoint. We request a checkpoint
+		 * later anyway, just for safety.
+		 */
+		CreateEndOfRecoveryRecord();
+	}
+	else
+	{
+		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+						  CHECKPOINT_IMMEDIATE |
+						  CHECKPOINT_WAIT);
+	}
+
+	return promoted;
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

v35-0002-miscellaneous-remove-dependency-on-global-and-lo.patchapplication/x-patch; name=v35-0002-miscellaneous-remove-dependency-on-global-and-lo.patchDownload

From 063836dc733e7dee01196c847f0dc8cfb4302e29 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 06:43:32 -0400
Subject: [PATCH v35 02/10] miscellaneous: remove dependency on global and
 local variable.

Removes dependency on global variables and some local variable in
StartupXLOG() function which may be available and/or deduced through
the information available into shared memory.

Changes enable us to move some of the code from StartupXLOG() into
a separate function that can be executable by the other process which
are connected to shared memory.
---
 src/backend/access/transam/xlog.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a6ddfb1872..c9d5bf9a72c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7902,7 +7902,11 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
-	if (InRecovery)
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server wasn't
+	 * shut down cleanly, which been through recovery.
+	 */
+	if (ControlFile->state != DB_SHUTDOWNED)
 	{
 		/*
 		 * Perform a checkpoint to update all our recovery activity to disk.
@@ -7919,7 +7923,7 @@ StartupXLOG(void)
 		 * queries.
 		 */
 		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
+			PromoteIsTriggered())
 		{
 			promoted = true;
 
@@ -7945,6 +7949,8 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryIsRequested())
 	{
+		XLogRecPtr	EndOfLog;
+
 		/*
 		 * And finally, execute the recovery_end_command, if any.
 		 */
@@ -7962,6 +7968,7 @@ StartupXLOG(void)
 		 * pre-allocated files containing garbage. In any case, they are not
 		 * part of the new timeline's history so we don't need them.
 		 */
+		(void) GetLastSegSwitchData(&EndOfLog);
 		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
 
 		/*
@@ -7998,6 +8005,7 @@ StartupXLOG(void)
 		{
 			char		origfname[MAXFNAMELEN];
 			XLogSegNo	endLogSegNo;
+			TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
 			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-- 
2.18.0

v35-0001-Store-ArchiveRecoveryRequested-in-shared-memory-.patchapplication/x-patch; name=v35-0001-Store-ArchiveRecoveryRequested-in-shared-memory-.patchDownload

From 16c5fced3c4d456e5f51e7e84a153770059357a1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 20 Sep 2021 05:52:19 -0400
Subject: [PATCH v35 01/10] Store ArchiveRecoveryRequested in shared memory and
 change its scope.

Storing ArchiveRecoveryRequested value in shared memory makes it
accessible to other processes as well. As of now no other process does
care about that but this will help to move code executed when
ArchiveRecoveryRequested set to other processes.

Also, the patch does change the scope of ArchiveRecoveryRequested
global to local, and type to integer. Now it has three values as -1
for the unknown, 1 for the request is made & 0 for no request or
request is completed.
---
 src/backend/access/transam/timeline.c    |   6 +-
 src/backend/access/transam/xlog.c        | 122 ++++++++++++++++-------
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/include/access/xlog.h                |   1 +
 src/include/access/xlog_internal.h       |   1 -
 5 files changed, 93 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c1756..0d4951b8255 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -93,7 +93,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 		return list_make1(entry);
 	}
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		TLHistoryFileName(histfname, targetTLI);
 		fromArchive =
@@ -229,7 +229,7 @@ existsTimeLineHistory(TimeLineID probeTLI)
 	if (probeTLI == 1)
 		return false;
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		TLHistoryFileName(histfname, probeTLI);
 		RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
@@ -331,7 +331,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	/*
 	 * If a history file exists for the parent, copy it verbatim
 	 */
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		TLHistoryFileName(histfname, parentTLI);
 		RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749da..4a6ddfb1872 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -239,17 +239,23 @@ static bool LocalPromoteIsTriggered = false;
 static int	LocalXLogInsertAllowed = -1;
 
 /*
- * When ArchiveRecoveryRequested is set, archive recovery was requested,
- * ie. signal files were present. When InArchiveRecovery is set, we are
- * currently recovering using offline XLOG archives. These variables are only
- * valid in the startup process.
- *
- * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
- * currently performing crash recovery using only XLOG files in pg_wal, but
- * will switch to using offline XLOG archives as soon as we reach the end of
- * WAL in pg_wal.
-*/
-bool		ArchiveRecoveryRequested = false;
+ * When ArchiveRecoveryRequested is ARCHIVE_RECOVERY_REQUEST_YES, archive
+ * recovery was requested, ie. signal files were present. When InArchiveRecovery
+ * is set, we are currently recovering using offline XLOG archives.
+ *
+ * When ArchiveRecoveryRequested is ARCHIVE_RECOVERY_REQUEST_YES, but
+ * InArchiveRecovery is false, we're currently performing crash recovery using
+ * only XLOG files in pg_wal, but will switch to using offline XLOG archives as
+ * soon as we reach the end of WAL in pg_wal.
+ *
+ * InArchiveRecovery only valid in the startup process. ArchiveRecoveryRequested
+ * can be acccsed through ArchiveRecoveryIsRequested().
+ */
+#define ARCHIVE_RECOVERY_REQUEST_UNKOWN		-1
+#define ARCHIVE_RECOVERY_REQUEST_NO			0
+#define ARCHIVE_RECOVERY_REQUEST_YES		1
+
+static int	ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
 bool		InArchiveRecovery = false;
 
 static bool standby_signal_file_found = false;
@@ -637,6 +643,12 @@ typedef struct XLogCtlData
 	 */
 	RecoveryState SharedRecoveryState;
 
+	/*
+	 * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+	 * requested. Protected by info_lck.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
 	 * run.  Protected by info_lck.
@@ -4455,7 +4467,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
+			if (!InArchiveRecovery && ArchiveRecoveryIsRequested() &&
 				!fetching_ckpt)
 			{
 				ereport(DEBUG1,
@@ -5223,6 +5235,7 @@ XLOGShmemInit(void)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
@@ -5485,16 +5498,16 @@ readRecoverySignalFile(void)
 	}
 
 	StandbyModeRequested = false;
-	ArchiveRecoveryRequested = false;
+	ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_NO;
 	if (standby_signal_file_found)
 	{
 		StandbyModeRequested = true;
-		ArchiveRecoveryRequested = true;
+		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_YES;
 	}
 	else if (recovery_signal_file_found)
 	{
 		StandbyModeRequested = false;
-		ArchiveRecoveryRequested = true;
+		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_YES;
 	}
 	else
 		return;
@@ -5507,12 +5520,18 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.  A lock is not
+	 * needed since we are the only ones who updating this.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = (bool) ArchiveRecoveryRequested;
 }
 
 static void
 validateRecoveryParameters(void)
 {
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return;
 
 	/*
@@ -5750,7 +5769,7 @@ recoveryStopsBefore(XLogReaderState *record)
 	 * Ignore recovery target settings when not in archive recovery (meaning
 	 * we are in crash recovery).
 	 */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return false;
 
 	/* Check if we should stop as soon as reaching consistency */
@@ -5897,7 +5916,7 @@ recoveryStopsAfter(XLogReaderState *record)
 	 * Ignore recovery target settings when not in archive recovery (meaning
 	 * we are in crash recovery).
 	 */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return false;
 
 	info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
@@ -6211,7 +6230,7 @@ recoveryApplyDelay(XLogReaderState *record)
 		return false;
 
 	/* nothing to do if crash recovery is requested */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		return false;
 
 	/*
@@ -6455,7 +6474,7 @@ CheckRequiredParameterValues(void)
 	 * For archive recovery, the WAL must be generated with at least 'replica'
 	 * wal_level.
 	 */
-	if (ArchiveRecoveryRequested && ControlFile->wal_level == WAL_LEVEL_MINIMAL)
+	if (ArchiveRecoveryIsRequested() && ControlFile->wal_level == WAL_LEVEL_MINIMAL)
 	{
 		ereport(FATAL,
 				(errmsg("WAL was generated with wal_level=minimal, cannot continue recovering"),
@@ -6467,7 +6486,7 @@ CheckRequiredParameterValues(void)
 	 * For Hot Standby, the WAL must be generated with 'replica' mode, and we
 	 * must have at least as many backend slots as the primary.
 	 */
-	if (ArchiveRecoveryRequested && EnableHotStandby)
+	if (ArchiveRecoveryIsRequested() && EnableHotStandby)
 	{
 		/* We ignore autovacuum_max_workers when we make this test. */
 		RecoveryRequiresIntParameter("max_connections",
@@ -6633,7 +6652,7 @@ StartupXLOG(void)
 	readRecoverySignalFile();
 	validateRecoveryParameters();
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		if (StandbyModeRequested)
 			ereport(LOG,
@@ -6666,7 +6685,7 @@ StartupXLOG(void)
 	 * Take ownership of the wakeup latch if we're going to sleep during
 	 * recovery.
 	 */
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/* Set up XLOG reader facility */
@@ -6833,7 +6852,7 @@ StartupXLOG(void)
 		 * to minRecoveryPoint, up to backupEndPoint, or until we see an
 		 * end-of-backup record), and we can enter archive recovery directly.
 		 */
-		if (ArchiveRecoveryRequested &&
+		if (ArchiveRecoveryIsRequested() &&
 			(ControlFile->minRecoveryPoint != InvalidXLogRecPtr ||
 			 ControlFile->backupEndRequired ||
 			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
@@ -7063,7 +7082,7 @@ StartupXLOG(void)
 	}
 	else if (ControlFile->state != DB_SHUTDOWNED)
 		InRecovery = true;
-	else if (ArchiveRecoveryRequested)
+	else if (ArchiveRecoveryIsRequested())
 	{
 		/* force recovery due to presence of recovery signal file */
 		InRecovery = true;
@@ -7229,7 +7248,7 @@ StartupXLOG(void)
 		 * control file and we've established a recovery snapshot from a
 		 * running-xacts WAL record.
 		 */
-		if (ArchiveRecoveryRequested && EnableHotStandby)
+		if (ArchiveRecoveryIsRequested() && EnableHotStandby)
 		{
 			TransactionId *xids;
 			int			nxids;
@@ -7646,7 +7665,7 @@ StartupXLOG(void)
 		 * This check is intentionally after the above log messages that
 		 * indicate how far recovery went.
 		 */
-		if (ArchiveRecoveryRequested &&
+		if (ArchiveRecoveryIsRequested() &&
 			recoveryTarget != RECOVERY_TARGET_UNSET &&
 			!reachedRecoveryTarget)
 			ereport(FATAL,
@@ -7674,7 +7693,7 @@ StartupXLOG(void)
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 		DisownLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/*
@@ -7725,7 +7744,7 @@ StartupXLOG(void)
 		 * crashes while an online backup is in progress. We must not treat
 		 * that as an error, or the database will refuse to start up.
 		 */
-		if (ArchiveRecoveryRequested || ControlFile->backupEndRequired)
+		if (ArchiveRecoveryIsRequested() || ControlFile->backupEndRequired)
 		{
 			if (ControlFile->backupEndRequired)
 				ereport(FATAL,
@@ -7771,7 +7790,7 @@ StartupXLOG(void)
 	 * In a normal crash recovery, we can just extend the timeline we were in.
 	 */
 	PrevTimeLineID = ThisTimeLineID;
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		char	   *reason;
 		char		recoveryPath[MAXPGPATH];
@@ -7899,7 +7918,7 @@ StartupXLOG(void)
 		 * after we're fully out of recovery mode and already accepting
 		 * queries.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
 			LocalPromoteIsTriggered)
 		{
 			promoted = true;
@@ -7924,7 +7943,7 @@ StartupXLOG(void)
 		}
 	}
 
-	if (ArchiveRecoveryRequested)
+	if (ArchiveRecoveryIsRequested())
 	{
 		/*
 		 * And finally, execute the recovery_end_command, if any.
@@ -8003,6 +8022,15 @@ StartupXLOG(void)
 				XLogArchiveNotify(partialfname);
 			}
 		}
+
+		/*
+		 * Done with archive recovery request, clear the shared memory state
+		 * which no longer needed.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		XLogCtl->SharedArchiveRecoveryRequested = false;
+		ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+		SpinLockRelease(&XLogCtl->info_lck);
 	}
 
 	/*
@@ -8263,6 +8291,32 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Is the archive recovery is requested?
+ *
+ * If ArchiveRecoveryRequested is unknown, then it will be updated by checking
+ * shared memory. Like PromoteIsTriggered(), this works in any process that's
+ * connected to shared memory.
+ */
+bool
+ArchiveRecoveryIsRequested(void)
+{
+	/*
+	 * If not UNKNOWN, the ArchiveRecoveryRequested value either
+	 * ARCHIVE_RECOVERY_REQUEST_YES => 1 or ARCHIVE_RECOVERY_REQUEST_NO => 0
+	 * which can be coerced to boolean true or false respectively.
+	 */
+	if (likely(ArchiveRecoveryRequested != ARCHIVE_RECOVERY_REQUEST_UNKOWN))
+		return (bool) ArchiveRecoveryRequested;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ArchiveRecoveryRequested = XLogCtl->SharedArchiveRecoveryRequested ?
+		ARCHIVE_RECOVERY_REQUEST_YES : ARCHIVE_RECOVERY_REQUEST_NO;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return (bool) ArchiveRecoveryRequested;
+}
+
 /*
  * Returns current recovery state from shared memory.
  *
@@ -10174,7 +10228,7 @@ xlog_redo(XLogReaderState *record)
 		 * record, the backup was canceled and the end-of-backup record will
 		 * never arrive.
 		 */
-		if (ArchiveRecoveryRequested &&
+		if (ArchiveRecoveryIsRequested() &&
 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
 			ereport(PANIC,
@@ -12176,7 +12230,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster)
+		if (ArchiveRecoveryIsRequested() && IsUnderPostmaster)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e754b..756d03adb6f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -67,7 +67,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	 * Ignore restore_command when not in archive recovery (meaning we are in
 	 * crash recovery).
 	 */
-	if (!ArchiveRecoveryRequested)
+	if (!ArchiveRecoveryIsRequested())
 		goto not_available;
 
 	/* In standby mode, restore_command might not be supplied */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e2c94a05ff..8ea4e583980 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -289,6 +289,7 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern bool ArchiveRecoveryIsRequested(void);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 3b5eceff658..2051953d404 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -319,7 +319,6 @@ extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
  * Exported for the functions in timeline.c and xlogarchive.c.  Only valid
  * in the startup process.
  */
-extern bool ArchiveRecoveryRequested;
 extern bool InArchiveRecovery;
 extern bool StandbyMode;
 extern char *recoveryRestoreCommand;
-- 
2.18.0

#164

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Mark Dilger (#158)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 15, 2021 at 4:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 16, 2020, at 6:55 AM, amul sul <sulamul@gmail.com> wrote:

(2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.

Andres,

I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned page held by the cursor, when another session disables WAL. It would be very interesting to test how the vacuum handles that specific change. I have not figured out the cleanest way to do this, though, as we don't as a project yet have a standard way of setting up xid exhaustion in a regression test, do we? The closest I saw to it was your work in [1], but that doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which isn't useable from inside an isolation test. Do you have a suggestion how best to continue developing out the test infrastructure?

Amul,

The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the isolation tester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test to abort and all subsequent permutations to not run. Attached patch v34-0009 fixes that.

Interesting.

Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a different session. The expected output is worth reviewing to see how this plays out. I don't see anything in there which is obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WAL is now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one of these errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky.

Can't we do the same in the TAP test? If the intention is only to test
session termination when the system changes to WAL are prohibited then
that I have added in the latest version, but that test does not
reinitiate the same connection again, I think that is not possible
there too.

Regards,
Amul

#165

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Amul Sul (#164)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 22, 2021, at 6:14 AM, Amul Sul <sulamul@gmail.com> wrote:

Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a different session. The expected output is worth reviewing to see how this plays out. I don't see anything in there which is obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WAL is now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one of these errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky.

Can't we do the same in the TAP test? If the intention is only to test
session termination when the system changes to WAL are prohibited then
that I have added in the latest version, but that test does not
reinitiate the same connection again, I think that is not possible
there too.

Perhaps you can point me to a TAP test that does this in a concise fashion. When I tried writing a TAP test for this, it was much longer than the equivalent isolation test spec.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#166

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Mark Dilger (#165)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 22, 2021 at 6:59 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Sep 22, 2021, at 6:14 AM, Amul Sul <sulamul@gmail.com> wrote:

Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a different session. The expected output is worth reviewing to see how this plays out. I don't see anything in there which is obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WAL is now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one of these errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky.

Can't we do the same in the TAP test? If the intention is only to test
session termination when the system changes to WAL are prohibited then
that I have added in the latest version, but that test does not
reinitiate the same connection again, I think that is not possible
there too.

Perhaps you can point me to a TAP test that does this in a concise fashion. When I tried writing a TAP test for this, it was much longer than the equivalent isolation test spec.

Yes, that is a bit longer, here is the snip from v35-0010 patch:

+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+ [
+ 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+ $node_primary->connstr('postgres')
+ ],
+ '<',
+ \$mysession_stdin,
+ '>',
+ \$mysession_stdout,
+ '2>',
+ \$mysession_stderr,
+ $psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(7);
+SELECT $$value-7-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-7-inserted-into-tab/,
+ 'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+ 'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+ 'session with open write transaction is terminated');

Regards,
Amul

#167

Mark Dilger

mark.dilger@enterprisedb.com

over 4 years ago

In reply to: Amul Sul (#166)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sep 22, 2021, at 6:39 AM, Amul Sul <sulamul@gmail.com> wrote:

Yes, that is a bit longer, here is the snip from v35-0010 patch

Right, that's longer, and only tests one interaction. The isolation spec I posted upthread tests multiple interactions between the session which uses cursors and the system going read-only. Whether the session using a cursor gets a FATAL, just an ERROR, or neither depends on where it is in the process of opening, using, closing and committing. I think that's interesting. If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way as, for example, to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#168

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Mark Dilger (#167)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Sep 22, 2021 at 7:33 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Sep 22, 2021, at 6:39 AM, Amul Sul <sulamul@gmail.com> wrote:

Yes, that is a bit longer, here is the snip from v35-0010 patch

Right, that's longer, and only tests one interaction. The isolation spec I posted upthread tests multiple interactions between the session which uses cursors and the system going read-only. Whether the session using a cursor gets a FATAL, just an ERROR, or neither depends on where it is in the process of opening, using, closing and committing. I think that's interesting. If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way as, for example, to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output.

Agreed.

Regards,
Amul

#169

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#162)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote:

Ok, understood, I have separated my changes into 0001 and 0002 patch,
and the refactoring patches start from 0003.

I think it would be better in the other order, with the refactoring
patches at the beginning of the series.

In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
memory as said previously. Coping ArchiveRecoveryRequested value to
shared memory is not really interesting, and I think somehow we should
reuse existing variable, (perhaps, with some modification of the
information it can store, e.g. adding one more enum value for
SharedRecoveryState or something else), thinking on the same.

In addition to that, I tried to turn down the scope of
ArchiveRecoveryRequested global variable. Now, this is a static
variable, and the scope is limited to xlog.c file like
LocalXLogInsertAllowed and can be accessed through the newly added
function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
me know what you think about the approach.

I'm not sure yet whether I like this or not, but it doesn't seem like
a terrible idea. You spelled UNKNOWN wrong, though, which does seem
like a terrible idea. :-) "acccsed" is not correct either.

Also, the new comments for ArchiveRecoveryRequested /
ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is
substitute the new terminology into the existing comment, but that
means that the purpose of the new "unknown" value is not at all clear.

Consider the following two patch fragments:

+ * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+ * requested. Protected by info_lck.
...
+ * Remember archive recovery request in shared memory state.  A lock is not
+ * needed since we are the only ones who updating this.

These two comments directly contradict each other.

+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->SharedArchiveRecoveryRequested = false;
+ ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+ SpinLockRelease(&XLogCtl->info_lck);

This seems odd to me. In the first place, there doesn't seem to be any
value in clearing this -- we're just expending extra CPU cycles to get
rid of a value that wouldn't be used anyway. In the second place, if
somehow someone checked the value after this point, with this code,
they might get the wrong answer, whereas if you just deleted this,
they would get the right answer.

In 0002 patch is a mixed one where I tried to remove the dependencies
on global variables and local variables belonging to StartupXLOG(). I
am still worried about the InRecovery value that needs to be deduced
afterward inside XLogAcceptWrites(). Currently, relying on
ControlFile->state != DB_SHUTDOWNED check but I think that will not be
good for ASRO where we plan to skip XLogAcceptWrites() work only and
let the StartupXLOG() do the rest of the work as it is where it will
going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
might need some ugly kludge to call PerformRecoveryXLogAction() in
checkpointer irrespective of DBState, which makes me a bit
uncomfortable.

I think that replacing the if (InRecovery) test with if
(ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I
mean, there are three separate places where we set InRecovery = true.
One of those executes if ControlFile->state != DB_SHUTDOWNED, matching
what you have here, but it also can happen if checkPoint.redo <
RecPtr, or if read_backup_label is true and ReadCheckpointRecord
returns non-NULL. Now maybe you're going to tell me that in those
other two cases we can't reach here anyway, but I don't see off-hand
why that should be true, and even if it is true, it seems like kind of
a fragile thing to rely on. I think we need to rely on something in
shared memory that is more explicitly connected to the question of
whether we are in recovery.

The other part of this patch has to do with whether we can use the
return value of GetLastSegSwitchData as a substitute for relying on
EndOfLog. Now as you have it, you end up creating a local variable
called EndOfLog that shadows another such variable in an outer scope,
which probably would not make anyone who noticed things in such a
state very happy. However, that will naturally get fixed if you
reorder the patches as per above, so let's turn to the central
question: is this a good way of getting EndOfLog? The value that would
be in effect at the time this code is executed is set here:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false);
EndOfLog = EndRecPtr;

Subsequently we do this:

/* start the archive_timeout timer and LSN running */
XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
XLogCtl->lastSegSwitchLSN = EndOfLog;

So at that point the value that GetLastSegSwitchData() would return
has to match what's in the existing variable. But later XLogWrite()
will change the value. So the question boils down to whether
XLogWrite() could have been called between the assignment just above
and when this code runs. And the answer seems to pretty clear be yes,
because just above this code, we might have done
CreateEndOfRecoveryRecord() or RequestCheckpoint(), and just above
that, we did UpdateFullPageWrites(). So I don't think this is right.

(3) CheckpointStats, which is called from RemoveXlogFile which is
called from RemoveNonParentXlogFiles which is called from
CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
This last one is actually pretty weird already in the existing code.
It sort of looks like RemoveXlogFile() only expects to be called from
the checkpointer (or a standalone backend) so that it can update
CheckpointStats and have that just work, but actually it's also called
from the startup process when a timeline switch happens. I don't know
whether the fact that the increments to ckpt_segs_recycled get lost in
that case should be considered an intentional behavior that should be
preserved or an inadvertent mistake. >

Maybe I could be wrong, but I think that is intentional. It removes
pre-allocated or bogus files of the old timeline which are not
supposed to be considered in stats. The comments for
CheckpointStatsData might not be clear but comment at the calling
RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:

Sure, I'm not saying the files are being removed by accident. I'm
saying it may be accidental that the removals are (I think) not going
to make it into the stats.

--
Robert Haas
EDB: http://www.enterprisedb.com

#170

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Robert Haas (#169)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Sep 23, 2021 at 11:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote:

Ok, understood, I have separated my changes into 0001 and 0002 patch,
and the refactoring patches start from 0003.

I think it would be better in the other order, with the refactoring
patches at the beginning of the series.

Ok, will do that. I did this other way to minimize the diff e.g.
deletion diff of RecoveryXlogAction enum and
DetermineRecoveryXlogAction(), etc.

In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
memory as said previously. Coping ArchiveRecoveryRequested value to
shared memory is not really interesting, and I think somehow we should
reuse existing variable, (perhaps, with some modification of the
information it can store, e.g. adding one more enum value for
SharedRecoveryState or something else), thinking on the same.

In addition to that, I tried to turn down the scope of
ArchiveRecoveryRequested global variable. Now, this is a static
variable, and the scope is limited to xlog.c file like
LocalXLogInsertAllowed and can be accessed through the newly added
function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
me know what you think about the approach.

I'm not sure yet whether I like this or not, but it doesn't seem like
a terrible idea. You spelled UNKNOWN wrong, though, which does seem
like a terrible idea. :-) "acccsed" is not correct either.

Also, the new comments for ArchiveRecoveryRequested /
ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is
substitute the new terminology into the existing comment, but that
means that the purpose of the new "unknown" value is not at all clear.

Ok, will fix those typos and try to improve the comments.

Consider the following two patch fragments:

+ * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+ * requested. Protected by info_lck.
...
+ * Remember archive recovery request in shared memory state.  A lock is not
+ * needed since we are the only ones who updating this.

These two comments directly contradict each other.

Okay, the first comment is not clear enough, I will fix that too. I
meant we don't need the lock now since we are the only one updating at
this point.

+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->SharedArchiveRecoveryRequested = false;
+ ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+ SpinLockRelease(&XLogCtl->info_lck);
This seems odd to me. In the first place, there doesn't seem to be any
value in clearing this -- we're just expending extra CPU cycles to get
rid of a value that wouldn't be used anyway. In the second place, if
somehow someone checked the value after this point, with this code,
they might get the wrong answer, whereas if you just deleted this,
they would get the right answer.

Previously, this flag was only valid in the startup process. But now
it will be valid for all the processes and will stay until the whole
server gets restarted. I don't want anybody to use this flag after the
cleanup point and just be sure I am explicitly cleaning this.

By the way, I also don't expect we should go with this approach. I
proposed this by referring to the PromoteIsTriggered() implementation,
but IMO, it is better to have something else since we just want to perform
archive cleanup operation, and most of the work related to archive
recovery was done inside the StartupXLOG().

Rather than the proposed design, I was thinking of adding one or two
more RecoveryState enums. And while skipping XLogAcceptsWrite() set
XLogCtl->SharedRecoveryState appropriately, so that we can easily
identify that the archive recovery was requested previously and now,
we need to perform its pending cleanup operation. Thoughts?

In 0002 patch is a mixed one where I tried to remove the dependencies
on global variables and local variables belonging to StartupXLOG(). I
am still worried about the InRecovery value that needs to be deduced
afterward inside XLogAcceptWrites(). Currently, relying on
ControlFile->state != DB_SHUTDOWNED check but I think that will not be
good for ASRO where we plan to skip XLogAcceptWrites() work only and
let the StartupXLOG() do the rest of the work as it is where it will
going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
might need some ugly kludge to call PerformRecoveryXLogAction() in
checkpointer irrespective of DBState, which makes me a bit
uncomfortable.

I think that replacing the if (InRecovery) test with if
(ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I
mean, there are three separate places where we set InRecovery = true.
One of those executes if ControlFile->state != DB_SHUTDOWNED, matching
what you have here, but it also can happen if checkPoint.redo <
RecPtr, or if read_backup_label is true and ReadCheckpointRecord
returns non-NULL. Now maybe you're going to tell me that in those
other two cases we can't reach here anyway, but I don't see off-hand
why that should be true, and even if it is true, it seems like kind of
a fragile thing to rely on. I think we need to rely on something in
shared memory that is more explicitly connected to the question of
whether we are in recovery.

No, this is the other way. I haven't picked (ControlFile->state !=
DB_SHUTDOWNED) condition because it setting InRecovery, rather, I
picked because InRecovery flag is setting ControlFile->state to either
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, see next if
(InRecovery) block after InRecovery flag gets set. It is certain that
when the system will be InRecovery, it will have the DBState other
than DB_SHUTDOWNED. But that isn't a clean approach for me because
when it will be in WAL prohibited the DBState will be DB_IN_PRODUCTION
which will not work, as I mentioned previously.

I am too thinking about passing this information via shared memory but
trying to somehow avoid this, lets see.

The other part of this patch has to do with whether we can use the
return value of GetLastSegSwitchData as a substitute for relying on
EndOfLog. Now as you have it, you end up creating a local variable
called EndOfLog that shadows another such variable in an outer scope,
which probably would not make anyone who noticed things in such a
state very happy. However, that will naturally get fixed if you
reorder the patches as per above, so let's turn to the central
question: is this a good way of getting EndOfLog? The value that would
be in effect at the time this code is executed is set here:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false);
EndOfLog = EndRecPtr;

Subsequently we do this:

/* start the archive_timeout timer and LSN running */
XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
XLogCtl->lastSegSwitchLSN = EndOfLog;

So at that point the value that GetLastSegSwitchData() would return
has to match what's in the existing variable. But later XLogWrite()
will change the value. So the question boils down to whether
XLogWrite() could have been called between the assignment just above
and when this code runs. And the answer seems to pretty clear be yes,
because just above this code, we might have done
CreateEndOfRecoveryRecord() or RequestCheckpoint(), and just above
that, we did UpdateFullPageWrites(). So I don't think this is right.

You are correct, if XLogWrite() called between the lastSegSwitchLSN
value can be changed, but the question is, will that change in our
case. I think it won't, let me explain.

IIUC, lastSegSwitchLSN will change generally in XLogWrite(), if the
previous WAL has been filled up. But if we see closely what will be
going to be written before we do check lastSegSwitchLSN. Currently, we
have a record for full-page write and record for either recovery end
or checkpoint, all these are fixed size and I don't think going to
fill the whole 16MB wal file. Correct me if I am missing something.

(3) CheckpointStats, which is called from RemoveXlogFile which is
called from RemoveNonParentXlogFiles which is called from
CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
This last one is actually pretty weird already in the existing code.
It sort of looks like RemoveXlogFile() only expects to be called from
the checkpointer (or a standalone backend) so that it can update
CheckpointStats and have that just work, but actually it's also called
from the startup process when a timeline switch happens. I don't know
whether the fact that the increments to ckpt_segs_recycled get lost in
that case should be considered an intentional behavior that should be
preserved or an inadvertent mistake. >

Maybe I could be wrong, but I think that is intentional. It removes
pre-allocated or bogus files of the old timeline which are not
supposed to be considered in stats. The comments for
CheckpointStatsData might not be clear but comment at the calling
RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:

Sure, I'm not saying the files are being removed by accident. I'm
saying it may be accidental that the removals are (I think) not going
to make it into the stats.

Understood, it looks like I missed the concluding line in the previous
reply. My point was if deleting bogus files then why should we care
about counting them in stats.

Regards,
Amul

#171

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Amul Sul (#170)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Sep 24, 2021 at 5:07 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Sep 23, 2021 at 11:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote:

Ok, understood, I have separated my changes into 0001 and 0002 patch,
and the refactoring patches start from 0003.

I think it would be better in the other order, with the refactoring
patches at the beginning of the series.

Ok, will do that. I did this other way to minimize the diff e.g.
deletion diff of RecoveryXlogAction enum and
DetermineRecoveryXlogAction(), etc.

I have reversed the patch order. Now refactoring patches will be
first, and the patch that removes the dependencies on global & local
variables will be the last. I did the necessary modification in the
refactoring patches too e.g. removed DetermineRecoveryXlogAction() and
RecoveryXlogAction enum which is no longer needed (thanks to commit #
1d919de5eb3fffa7cc9479ed6d2915fb89794459 to make code simple).

To find the value of InRecovery after we clear it, patch still uses
ControlFile's DBState, but now the check condition changed to a more
specific one which is less confusing.

In casual off-list discussion, the point was made to check
SharedRecoveryState to find out the InRecovery value afterward, and
check that using RecoveryInProgress(). But we can't depend on
SharedRecoveryState because at the start it gets initialized to
RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
Therefore, we can't use RecoveryInProgress() which always returns
true if SharedRecoveryState != RECOVERY_STATE_DONE.

I am posting only refactoring patches for now.

Regards,
Amul

Attachments:

v36-0004-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v36-0004-Remove-dependencies-on-startup-process-specifica.patchDownload

From 730e8331fefc882b4cab7112adf0f4d8da1ea831 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v36 4/4] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are ArchiveRecoveryRequested and
LocalPromoteIsTriggered, whereas LocalPromoteIsTriggered can be
accessed in any other process using existing PromoteIsTriggered().
ArchiveRecoveryRequested is made accessible by copying into shared
memory.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively.  XLogCtl->lastSegSwitchLSN is not going to change until
we use it. That changes only when the current WAL segment gets full
which never going to happen because of two reasons, first WAL writes
are disabled for other processes until XLogAcceptWrites() finishes and
other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is
writes fix size wal records as full-page write and record for either
recovery end or checkpoint which not going to fill up the 16MB wal
segment.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 36 ++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91cdd7d9ff2..5b4e5ac379f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -659,6 +659,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -880,8 +887,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -927,7 +933,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -5230,6 +5236,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
@@ -5511,6 +5518,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5702,8 +5714,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+CleanupAfterArchiveRecovery(void)
 {
+	XLogRecPtr	EndOfLog;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5720,6 +5734,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
 	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
 
 	/*
@@ -5754,6 +5769,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	{
 		char		origfname[MAXFNAMELEN];
 		XLogSegNo	endLogSegNo;
+		TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
 		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
@@ -8023,7 +8039,7 @@ StartupXLOG(void)
  	Insert->fullPageWrites = lastFullPageWrites;
 
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * If there were cascading standby servers connected to us, nudge any wal
@@ -8045,7 +8061,7 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
 
@@ -8063,8 +8079,8 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 		promoted = PerformRecoveryXLogAction();
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+	if (XLogCtl->SharedArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -8232,8 +8248,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 76587e09ce6b7811ff940e2e65051cb49e7c16e6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 15:37:53 -0400
Subject: [PATCH v36 3/4] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 52 ++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 93849d8f29a..91cdd7d9ff2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -927,6 +927,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -8014,7 +8015,41 @@ StartupXLOG(void)
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+ 	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
+ 	 */
+ 	Insert->fullPageWrites = lastFullPageWrites;
+
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If there were cascading standby servers connected to us, nudge any wal
+	 * sender processes to notice that we've been promoted.
+	 */
+	WalSndWakeup();
+
+	/*
+	 * If this was a promotion, request an (online) checkpoint now. This isn't
+	 * required for consistency, but the last restartpoint might be far back,
+	 * and in case of a crash, recovering from it might take a longer than is
+	 * appropriate now that we're not in standby mode anymore.
+	 */
+	if (promoted)
+		RequestCheckpoint(CHECKPOINT_FORCE);
+}
+
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	bool		promoted = false;
+
+	/* Write an XLOG_FPW_CHANGE record */
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -8070,20 +8105,7 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
-	/*
-	 * If there were cascading standby servers connected to us, nudge any wal
-	 * sender processes to notice that we've been promoted.
-	 */
-	WalSndWakeup();
-
-	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
-		RequestCheckpoint(CHECKPOINT_FORCE);
+	return promoted;
 }
 
 /*
-- 
2.18.0

v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 1a14516bfca72febbc3e70f7d25398c0f074c3d8 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v36 1/4] Refactor some end-of-recovery code out of
 StartupXLOG().

Moved the code that performs whether to write a checkpoint or an
end-of-recovery record into PerformRecoveryXlogAction().

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 261 ++++++++++++++++--------------
 1 file changed, 143 insertions(+), 118 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749da..397f7d486a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -880,6 +880,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -925,6 +927,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5694,6 +5697,88 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be pre-allocated
+	 * files containing garbage. In any case, they are not part of the new
+	 * timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because it
+	 * was hit by a meteor), it will never make it to the archive. That's OK
+	 * from our point of view, because the new segment that we created with the
+	 * new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically present
+	 * in the new file with new TLI, but recovery won't look there when it's
+	 * recovering to the older timeline. On the other hand, if we archive the
+	 * partial segment, and the original server on that timeline is still
+	 * running and archives the completed version of the same segment later, it
+	 * will fail. (We used to do that in 9.4 and below, and it caused such
+	 * problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix, and
+	 * archive it. Archive recovery will never try to read .partial segments, so
+	 * they will normally go unused. But in the odd PITR case, the administrator
+	 * can copy them manually to the pg_wal directory (removing the suffix).
+	 * They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let it
+	 * be archived normally. (In particular, if it was restored from the archive
+	 * to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7883,127 +7968,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+		promoted = PerformRecoveryXLogAction();
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8207,6 +8178,60 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static bool
+PerformRecoveryXLogAction(void)
+{
+	bool		promoted = false;
+
+	/*
+	 * Perform a checkpoint to update all our recovery activity to disk.
+	 *
+	 * Note that we write a shutdown checkpoint rather than an on-line one. This
+	 * is not particularly critical, but since we may be assigning a new TLI,
+	 * using a shutdown checkpoint allows us to have the rule that TLI only
+	 * changes in shutdown checkpoints, which allows some extra error checking
+	 * in xlog_redo.
+	 *
+	 * In promotion, only create a lightweight end-of-recovery record instead of
+	 * a full checkpoint. A checkpoint is requested later, after we're fully out
+	 * of recovery mode and already accepting queries.
+	 */
+	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		LocalPromoteIsTriggered)
+	{
+		promoted = true;
+
+		/*
+		 * Insert a special WAL record to mark the end of recovery, since we
+		 * aren't doing a checkpoint. That means that the checkpointer process
+		 * may likely be in the middle of a time-smoothed restartpoint and could
+		 * continue to be for minutes after this.  That sounds strange, but the
+		 * effect is roughly the same and it would be stranger to try to come
+		 * out of the restartpoint and then checkpoint. We request a checkpoint
+		 * later anyway, just for safety.
+		 */
+		CreateEndOfRecoveryRecord();
+	}
+	else
+	{
+		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+						  CHECKPOINT_IMMEDIATE |
+						  CHECKPOINT_WAIT);
+	}
+
+	return promoted;
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

v36-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v36-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From bad3d0db320f68b083311578b1b17ff8cd1714c6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH v36 2/4] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, moved the code that performs whether to write a checkpoint
or an end-of-recovery record into PerformRecoveryXlogAction(), and
code performs post-archive-recovery into CleanupAfterArchiveRecovery(),
but called both the functions from the same place. Now postpone that
stuff until after we clear InRecovery and shut down the XLogReader.

We can find out of InRecovery value afterward by looking ControlFile's
DBState is needed to decide PerformRecoveryXlogAction().

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 40 +++++++++++++++++--------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 397f7d486a6..93849d8f29a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7958,24 +7958,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;

-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	if (InRecovery)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -8027,6 +8009,28 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);

+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	LocalSetXLogInsertAllowed();
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
+	 * through the archive or the crash recovery.
+	 */
+	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY ||
+		ControlFile->state == DB_IN_CRASH_RECOVERY)
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

#172

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Amul Sul (#171)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:

To find the value of InRecovery after we clear it, patch still uses
ControlFile's DBState, but now the check condition changed to a more
specific one which is less confusing.

In casual off-list discussion, the point was made to check
SharedRecoveryState to find out the InRecovery value afterward, and
check that using RecoveryInProgress(). But we can't depend on
SharedRecoveryState because at the start it gets initialized to
RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
Therefore, we can't use RecoveryInProgress() which always returns
true if SharedRecoveryState != RECOVERY_STATE_DONE.

Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes. Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?

[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej

It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.

I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.

--
Robert Haas
EDB: http://www.enterprisedb.com

#173

Rushabh Lathia

rushabh.lathia@gmail.com

over 4 years ago

In reply to: Robert Haas (#172)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Oct 1, 2021 at 2:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:

To find the value of InRecovery after we clear it, patch still uses
ControlFile's DBState, but now the check condition changed to a more
specific one which is less confusing.

In casual off-list discussion, the point was made to check
SharedRecoveryState to find out the InRecovery value afterward, and
check that using RecoveryInProgress(). But we can't depend on
SharedRecoveryState because at the start it gets initialized to
RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
Therefore, we can't use RecoveryInProgress() which always returns
true if SharedRecoveryState != RECOVERY_STATE_DONE.

Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes. Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?

[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej

I tried to apply the patch on the master branch head and it's failing
with conflicts.

Later applied patch on below commit and it got applied cleanly:

commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Sep 27 18:48:01 2021 -0400

Re-enable contrib/bloom's TAP tests.

rushabh@rushabh:postgresql$ git apply
v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
rushabh@rushabh:postgresql$ git apply
v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
rushabh@rushabh:postgresql$ git apply
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space
before tab in indent.
/*
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space
before tab in indent.
*/
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space
before tab in indent.
Insert->fullPageWrites = lastFullPageWrites;
warning: 3 lines add whitespace errors.
rushabh@rushabh:postgresql$ git apply
v36-0004-Remove-dependencies-on-startup-process-specifica.patch

There are whitespace errors on patch 0003.

It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.

I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.

--
Robert Haas
EDB: http://www.enterprisedb.com

--
Rushabh Lathia

#174

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Rushabh Lathia (#173)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

On Fri, Oct 1, 2021 at 2:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:

To find the value of InRecovery after we clear it, patch still uses
ControlFile's DBState, but now the check condition changed to a more
specific one which is less confusing.

In casual off-list discussion, the point was made to check
SharedRecoveryState to find out the InRecovery value afterward, and
check that using RecoveryInProgress(). But we can't depend on
SharedRecoveryState because at the start it gets initialized to
RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
Therefore, we can't use RecoveryInProgress() which always returns
true if SharedRecoveryState != RECOVERY_STATE_DONE.

Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes.

Well, I have to replace the InRecovery flag in that patch since we are
moving code past to the point where the InRecovery flag gets cleared.
If I don't do, then the 0002 patch would be wrong since InRecovery is
always false, and behaviour won't be the same as it was before that
patch.

Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?

[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej

I tried to apply the patch on the master branch head and it's failing
with conflicts.

Thanks, Rushabh, for the quick check, I have attached a rebased version for the
latest master head commit # f6b5d05ba9a.

Later applied patch on below commit and it got applied cleanly:

commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Sep 27 18:48:01 2021 -0400

Re-enable contrib/bloom's TAP tests.

rushabh@rushabh:postgresql$ git apply v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
rushabh@rushabh:postgresql$ git apply v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
rushabh@rushabh:postgresql$ git apply v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space before tab in indent.
/*
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space before tab in indent.
*/
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space before tab in indent.
Insert->fullPageWrites = lastFullPageWrites;
warning: 3 lines add whitespace errors.
rushabh@rushabh:postgresql$ git apply v36-0004-Remove-dependencies-on-startup-process-specifica.patch

There are whitespace errors on patch 0003.

Fixed.

It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.

Understood, used lastReplayedEndRecPtr but in 0002 patch for the
aforesaid reason.

I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.

Agreed, that would be the right thing, but on the latest master head
that might not be the right thing to use because of commit #
ff9f111bce24 that has introduced the following code that changes the
EndOfLog that could be different from replayEndRecPtr:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

With this commit, we have got two new global variables. First,
missingContrecPtr is an EndOfLog which gets stored in shared memory at
few places, and the other one abortedRecPtr that is needed in
XLogAcceptWrite(), which I have exported into shared memory.

Regards,
Amul

Attachments:

v37-0004-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v37-0004-Remove-dependencies-on-startup-process-specifica.patchDownload

From de79f7f46d101768269afa360f7183302eee9551 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v37 4/4] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, ArchiveRecoveryRequested and
LocalPromoteIsTriggered, whereas LocalPromoteIsTriggered can be
accessed in any other process using existing PromoteIsTriggered().
ArchiveRecoveryRequested and abortedRecPtr are made accessible by
copying into shared memory.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively.  XLogCtl->lastSegSwitchLSN is not going to change until
we use it. That changes only when the current WAL segment gets full
which never going to happen because of two reasons, first WAL writes
are disabled for other processes until XLogAcceptWrites() finishes and
other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is
writes fix size wal records as full-page write and record for either
recovery end or checkpoint which not going to fill up the 16MB wal
segment.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 63 +++++++++++++++++++++++--------
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5abb7c5e542..2dd81af8ca9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -668,6 +668,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -717,6 +724,13 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -889,8 +903,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -939,7 +952,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -5267,6 +5280,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
@@ -5548,6 +5562,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5739,8 +5758,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+CleanupAfterArchiveRecovery(void)
 {
+	XLogRecPtr	EndOfLog;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5757,6 +5778,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
 	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
 
 	/*
@@ -5791,6 +5813,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	{
 		char		origfname[MAXFNAMELEN];
 		XLogSegNo	endLogSegNo;
+		TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
 		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
@@ -7965,6 +7988,18 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process when WAL writes enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+
+		/* Shared memory value will be used further */
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8071,7 +8106,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8131,19 +8166,17 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
 
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(XLogCtl->SharedAbortedRecPtr))
 	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
+		CreateOverwriteContrecordRecord(XLogCtl->SharedAbortedRecPtr);
+		XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 	}
 
 	/* Write an XLOG_FPW_CHANGE record */
@@ -8161,8 +8194,8 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 		promoted = PerformRecoveryXLogAction();
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+	if (XLogCtl->SharedArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -8304,8 +8337,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v37-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v37-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From 3208b3379eb21f97157022d524c1df2d75ab5230 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH v37 2/4] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, moved the code that performs whether to write a checkpoint
or an end-of-recovery record into PerformRecoveryXlogAction(), and
code performs post-archive-recovery into CleanupAfterArchiveRecovery(),
but called both the functions from the same place. Now postpone that
stuff until after we clear InRecovery and shut down the XLogReader.

We do find out of InRecovery value afterward by looking
XLogCtl->lastReplayedEndRecPtr, that will be only get set inside the
REDO loop.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 62 +++++++++++++++++--------------
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7c258465780..cc08d8a475c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8018,34 +8018,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	if (InRecovery)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -8090,6 +8062,40 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
+	 * through the archive or the crash recovery.
+	 *
+	 * If the recovery is performed lastReplayedEndRecPtr will always be a valid
+	 * record pointer that never changes after REDO loop.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

v37-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v37-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 01731a5b955535b619f9fec887d25049e3137174 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v37 1/4] Refactor some end-of-recovery code out of
 StartupXLOG().

Moved the code that performs whether to write a checkpoint or an
end-of-recovery record into PerformRecoveryXlogAction().

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 261 ++++++++++++++++--------------
 1 file changed, 143 insertions(+), 118 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eddb13d13a7..7c258465780 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -889,6 +889,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -937,6 +939,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5731,6 +5734,88 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be pre-allocated
+	 * files containing garbage. In any case, they are not part of the new
+	 * timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because it
+	 * was hit by a meteor), it will never make it to the archive. That's OK
+	 * from our point of view, because the new segment that we created with the
+	 * new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically present
+	 * in the new file with new TLI, but recovery won't look there when it's
+	 * recovering to the older timeline. On the other hand, if we archive the
+	 * partial segment, and the original server on that timeline is still
+	 * running and archives the completed version of the same segment later, it
+	 * will fail. (We used to do that in 9.4 and below, and it caused such
+	 * problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix, and
+	 * archive it. Archive recovery will never try to read .partial segments, so
+	 * they will normally go unused. But in the odd PITR case, the administrator
+	 * can copy them manually to the pg_wal directory (removing the suffix).
+	 * They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let it
+	 * be archived normally. (In particular, if it was restored from the archive
+	 * to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7953,127 +8038,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+		promoted = PerformRecoveryXLogAction();
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8282,6 +8253,60 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static bool
+PerformRecoveryXLogAction(void)
+{
+	bool		promoted = false;
+
+	/*
+	 * Perform a checkpoint to update all our recovery activity to disk.
+	 *
+	 * Note that we write a shutdown checkpoint rather than an on-line one. This
+	 * is not particularly critical, but since we may be assigning a new TLI,
+	 * using a shutdown checkpoint allows us to have the rule that TLI only
+	 * changes in shutdown checkpoints, which allows some extra error checking
+	 * in xlog_redo.
+	 *
+	 * In promotion, only create a lightweight end-of-recovery record instead of
+	 * a full checkpoint. A checkpoint is requested later, after we're fully out
+	 * of recovery mode and already accepting queries.
+	 */
+	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		LocalPromoteIsTriggered)
+	{
+		promoted = true;
+
+		/*
+		 * Insert a special WAL record to mark the end of recovery, since we
+		 * aren't doing a checkpoint. That means that the checkpointer process
+		 * may likely be in the middle of a time-smoothed restartpoint and could
+		 * continue to be for minutes after this.  That sounds strange, but the
+		 * effect is roughly the same and it would be stranger to try to come
+		 * out of the restartpoint and then checkpoint. We request a checkpoint
+		 * later anyway, just for safety.
+		 */
+		CreateEndOfRecoveryRecord();
+	}
+	else
+	{
+		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+						  CHECKPOINT_IMMEDIATE |
+						  CHECKPOINT_WAIT);
+	}
+
+	return promoted;
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

v37-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v37-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 5d65798aef2aa6cae6b933b9d805aadf71a71b49 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v37 3/4] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 101 +++++++++++++++++-------------
 1 file changed, 59 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc08d8a475c..5abb7c5e542 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -939,6 +939,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -8062,52 +8063,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
 
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
-	 * through the archive or the crash recovery.
-	 *
-	 * If the recovery is performed lastReplayedEndRecPtr will always be a valid
-	 * record pointer that never changes after REDO loop.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8163,6 +8127,59 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	bool		promoted = false;
+
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/* Write an XLOG_FPW_CHANGE record */
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
+	 * through the archive or the crash recovery.
+	 *
+	 * If the recovery is performed lastReplayedEndRecPtr will always be a valid
+	 * record pointer that never changes after REDO loop.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#175

Jaime Casanova

jcasanov@systemguards.com.ec

over 4 years ago

In reply to: Amul Sul (#174)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:

On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

I tried to apply the patch on the master branch head and it's failing
with conflicts.

Thanks, Rushabh, for the quick check, I have attached a rebased version for the
latest master head commit # f6b5d05ba9a.

Hi,

I got this error while executing "make check" on src/test/recovery:

"""
t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query:
# SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 29 just after 1.
t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 2/3 subtests

Test Summary Report
-------------------
t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 3 tests but ran 1.
Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1
"""

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#176

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Jaime Casanova (#175)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:

On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

I tried to apply the patch on the master branch head and it's failing
with conflicts.

Thanks, Rushabh, for the quick check, I have attached a rebased version for the
latest master head commit # f6b5d05ba9a.

Hi,

I got this error while executing "make check" on src/test/recovery:

"""
t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query:
# SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 29 just after 1.
t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 2/3 subtests

Test Summary Report
-------------------
t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 3 tests but ran 1.
Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1
"""

Thanks for the reporting problem, I am working on it. The cause of
failure is that v37_0004 patch clearing the missingContrecPtr global
variable before CreateOverwriteContrecordRecord() execution, which it
shouldn't.

Regards,
Amul

#177

Amul Sul

sulamul@gmail.com

over 4 years ago

In reply to: Amul Sul (#176)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Oct 7, 2021 at 6:21 PM Amul Sul <sulamul@gmail.com> wrote:

On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:

On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

I tried to apply the patch on the master branch head and it's failing
with conflicts.

Thanks, Rushabh, for the quick check, I have attached a rebased version for the
latest master head commit # f6b5d05ba9a.

Hi,

I got this error while executing "make check" on src/test/recovery:

"""
t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query:
# SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 29 just after 1.
t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 2/3 subtests

Test Summary Report
-------------------
t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 3 tests but ran 1.
Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1
"""

Thanks for the reporting problem, I am working on it. The cause of
failure is that v37_0004 patch clearing the missingContrecPtr global
variable before CreateOverwriteContrecordRecord() execution, which it
shouldn't.

In the attached version I have fixed this issue by restoring missingContrecPtr.

To handle abortedRecPtr and missingContrecPtr newly added global
variables thought the commit # ff9f111bce24, we don't need to store
them in the shared memory separately, instead, we need a flag that
indicates a broken record found previously, at the end of recovery, so
that we can overwrite contrecord.

The missingContrecPtr is assigned to the EndOfLog, and we have handled
EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to
ensure that the lastReplayedEndRecPtr value is the same as the
abortedRecPtr, but I think that is not needed, we can go ahead and
write an overwrite-contrecord starting at lastReplayedEndRecPtr.

Regards,
Amul

Attachments:

v38-0004-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v38-0004-Remove-dependencies-on-startup-process-specifica.patchDownload

From 5bf021226d9742a6fefbcb33e54f7ef044d8fbcc Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v38 4/4] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  ArchiveRecoveryRequested is made
accessible by copying into shared memory.  abortedRecPtr and
missingContrecPtr can get from the existing shared memory values but
for that, we need a flag indicating that broken records was found
previously and OVERWRITE_CONTRECORD message needs to be written when
WAL writes permitted, added a flag for the same.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively.  XLogCtl->lastSegSwitchLSN is not going to change until
we use it. That changes only when the current WAL segment gets full
which never going to happen because of two reasons, first WAL writes
are disabled for other processes until XLogAcceptWrites() finishes and
other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is
writes fix size wal records as full-page write and record for either
recovery end or checkpoint which not going to fill up the 16MB wal
segment.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 90 ++++++++++++++++++++++++-------
 1 file changed, 72 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cdfec248081..b9596ca005c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -668,6 +668,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -706,9 +713,10 @@ typedef struct XLogCtlData
 
 	/*
 	 * lastReplayedEndRecPtr points to end+1 of the last record successfully
-	 * replayed. When we're currently replaying a record, ie. in a redo
-	 * function, replayEndRecPtr points to the end+1 of the record being
-	 * replayed, otherwise it's equal to lastReplayedEndRecPtr.
+	 * replayed and that could be point where broken record starts (if exists).
+	 * When we're currently replaying a record, ie. in a redo function,
+	 * replayEndRecPtr points to the end+1 of the record being replayed,
+	 * otherwise it's equal to lastReplayedEndRecPtr.
 	 */
 	XLogRecPtr	lastReplayedEndRecPtr;
 	TimeLineID	lastReplayedTLI;
@@ -717,6 +725,12 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * overwriteContrecord indicates if a record was found to be broken at the
+	 * end of recovery and OVERWRITE_CONTRECORD message needs to write.
+	 */
+	bool		overwriteContrecord;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -889,8 +903,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -939,7 +952,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -5267,6 +5280,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
@@ -5548,6 +5562,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5739,8 +5758,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+CleanupAfterArchiveRecovery(void)
 {
+	XLogRecPtr	EndOfLog;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5757,6 +5778,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
 	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
 
 	/*
@@ -5791,6 +5813,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	{
 		char		origfname[MAXFNAMELEN];
 		XLogSegNo	endLogSegNo;
+		TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
 		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
@@ -7965,6 +7988,27 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Set broken record found flag in shared memory. This process might
+		 * unable to write an OVERWRITE_CONTRECORD message because of WAL write
+		 * restriction.  Storing in shared memory helps that get written later
+		 * by another process when WAL writes enabled.
+		 */
+		XLogCtl->overwriteContrecord = true;
+
+		/*
+		 * While writing OVERWRITE_CONTRECORD message abortedRecPtr and
+		 * missingContrecPtr values need to be restored, and that can be fetched
+		 * from the shared memory as lastReplayedEndRecPtr is the abortedRecPtr
+		 * and missingContrecPtr is the EndOfLog which going to be stored at a
+		 * bunch of places in the shared memory (e.g. lastSegSwitchLSN which not
+		 * going to change before the point where the OVERWRITE_CONTRECORD
+		 * message gets written).
+		 */
+		Assert(!XLogRecPtrIsInvalid(abortedRecPtr == XLogCtl->lastReplayedEndRecPtr));
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8071,7 +8115,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8131,19 +8175,29 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
 
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(XLogCtl->overwriteContrecord))
 	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
+		/*
+		 * Restore missingContrecPtr, needed to set
+		 * XLP_FIRST_IS_OVERWRITE_CONTRECORD flag on the page header where
+		 * overwrite-contrecord get written. See AdvanceXLInsertBuffer().
+		 */
+		GetLastSegSwitchData(&missingContrecPtr);
+
+		/*
+		 * Start writing overwrite-contrecord after the point where the last
+		 * valid replyed record ended.
+		 */
+		CreateOverwriteContrecordRecord(XLogCtl->lastReplayedEndRecPtr);
+
+		XLogCtl->overwriteContrecord = false;
 	}
 
 	/* Write an XLOG_FPW_CHANGE record */
@@ -8161,8 +8215,8 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 		promoted = PerformRecoveryXLogAction();
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+	if (XLogCtl->SharedArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -8304,8 +8358,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v38-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v38-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 733035e06d5dedd2142a9f126332c056c8a4d42d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v38 3/4] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 101 +++++++++++++++++-------------
 1 file changed, 59 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6612b81e4b9..cdfec248081 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -939,6 +939,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -8062,52 +8063,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
 	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
 
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
-	 * through the archive or the crash recovery.
-	 *
-	 * If the recovery is performed lastReplayedEndRecPtr will always be a valid
-	 * record pointer that never changes after REDO loop.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8163,6 +8127,59 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	bool		promoted = false;
+
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/* Write an XLOG_FPW_CHANGE record */
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
+	 * through the archive or the crash recovery.
+	 *
+	 * If the recovery is performed lastReplayedEndRecPtr will always be a valid
+	 * record pointer that never changes after REDO loop.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

v38-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchapplication/x-patch; name=v38-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchDownload

From 19ac27a62187753eaef168785b6222bb9497de26 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 13:07:56 -0400
Subject: [PATCH v38 1/4] Refactor some end-of-recovery code out of
 StartupXLOG().

Moved the code that performs whether to write a checkpoint or an
end-of-recovery record into PerformRecoveryXlogAction().

Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
This is straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 261 ++++++++++++++++--------------
 1 file changed, 143 insertions(+), 118 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26dcc00ac01..44e5a0610ef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -889,6 +889,8 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
+static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
+										XLogRecPtr EndOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -937,6 +939,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
 static bool rescanLatestTimeLine(void);
@@ -5731,6 +5734,88 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+/*
+ * Perform cleanup actions at the conclusion of archive recovery.
+ */
+static void
+CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	/*
+	 * Execute the recovery_end_command, if any.
+	 */
+	if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
+		ExecuteRecoveryCommand(recoveryEndCommand,
+							   "recovery_end_command",
+							   true);
+
+	/*
+	 * We switched to a new timeline. Clean up segments on the old timeline.
+	 *
+	 * If there are any higher-numbered segments on the old timeline, remove
+	 * them. They might contain valid WAL, but they might also be pre-allocated
+	 * files containing garbage. In any case, they are not part of the new
+	 * timeline's history so we don't need them.
+	 */
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+
+	/*
+	 * If the switch happened in the middle of a segment, what to do with the
+	 * last, partial segment on the old timeline? If we don't archive it, and
+	 * the server that created the WAL never archives it either (e.g. because it
+	 * was hit by a meteor), it will never make it to the archive. That's OK
+	 * from our point of view, because the new segment that we created with the
+	 * new TLI contains all the WAL from the old timeline up to the switch
+	 * point. But if you later try to do PITR to the "missing" WAL on the old
+	 * timeline, recovery won't find it in the archive. It's physically present
+	 * in the new file with new TLI, but recovery won't look there when it's
+	 * recovering to the older timeline. On the other hand, if we archive the
+	 * partial segment, and the original server on that timeline is still
+	 * running and archives the completed version of the same segment later, it
+	 * will fail. (We used to do that in 9.4 and below, and it caused such
+	 * problems).
+	 *
+	 * As a compromise, we rename the last segment with the .partial suffix, and
+	 * archive it. Archive recovery will never try to read .partial segments, so
+	 * they will normally go unused. But in the odd PITR case, the administrator
+	 * can copy them manually to the pg_wal directory (removing the suffix).
+	 * They can be useful in debugging, too.
+	 *
+	 * If a .done or .ready file already exists for the old timeline, however,
+	 * we had already determined that the segment is complete, so we can let it
+	 * be archived normally. (In particular, if it was restored from the archive
+	 * to begin with, it's expected to have a .done file).
+	 */
+	if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
+		XLogArchivingActive())
+	{
+		char		origfname[MAXFNAMELEN];
+		XLogSegNo	endLogSegNo;
+
+		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
+		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
+
+		if (!XLogArchiveIsReadyOrDone(origfname))
+		{
+			char		origpath[MAXPGPATH];
+			char		partialfname[MAXFNAMELEN];
+			char		partialpath[MAXPGPATH];
+
+			XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
+			snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
+
+			/*
+			 * Make sure there's no .done or .ready file for the .partial
+			 * file.
+			 */
+			XLogArchiveCleanup(partialfname);
+
+			durable_rename(origpath, partialpath, ERROR);
+			XLogArchiveNotify(partialfname);
+		}
+	}
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7953,127 +8038,13 @@ StartupXLOG(void)
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
 
+	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
 	if (InRecovery)
-	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
-	}
+		promoted = PerformRecoveryXLogAction();
 
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-	{
-		/*
-		 * And finally, execute the recovery_end_command, if any.
-		 */
-		if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
-			ExecuteRecoveryCommand(recoveryEndCommand,
-								   "recovery_end_command",
-								   true);
-
-		/*
-		 * We switched to a new timeline. Clean up segments on the old
-		 * timeline.
-		 *
-		 * If there are any higher-numbered segments on the old timeline,
-		 * remove them. They might contain valid WAL, but they might also be
-		 * pre-allocated files containing garbage. In any case, they are not
-		 * part of the new timeline's history so we don't need them.
-		 */
-		RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
-
-		/*
-		 * If the switch happened in the middle of a segment, what to do with
-		 * the last, partial segment on the old timeline? If we don't archive
-		 * it, and the server that created the WAL never archives it either
-		 * (e.g. because it was hit by a meteor), it will never make it to the
-		 * archive. That's OK from our point of view, because the new segment
-		 * that we created with the new TLI contains all the WAL from the old
-		 * timeline up to the switch point. But if you later try to do PITR to
-		 * the "missing" WAL on the old timeline, recovery won't find it in
-		 * the archive. It's physically present in the new file with new TLI,
-		 * but recovery won't look there when it's recovering to the older
-		 * timeline. On the other hand, if we archive the partial segment, and
-		 * the original server on that timeline is still running and archives
-		 * the completed version of the same segment later, it will fail. (We
-		 * used to do that in 9.4 and below, and it caused such problems).
-		 *
-		 * As a compromise, we rename the last segment with the .partial
-		 * suffix, and archive it. Archive recovery will never try to read
-		 * .partial segments, so they will normally go unused. But in the odd
-		 * PITR case, the administrator can copy them manually to the pg_wal
-		 * directory (removing the suffix). They can be useful in debugging,
-		 * too.
-		 *
-		 * If a .done or .ready file already exists for the old timeline,
-		 * however, we had already determined that the segment is complete, so
-		 * we can let it be archived normally. (In particular, if it was
-		 * restored from the archive to begin with, it's expected to have a
-		 * .done file).
-		 */
-		if (XLogSegmentOffset(EndOfLog, wal_segment_size) != 0 &&
-			XLogArchivingActive())
-		{
-			char		origfname[MAXFNAMELEN];
-			XLogSegNo	endLogSegNo;
-
-			XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
-			XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
-
-			if (!XLogArchiveIsReadyOrDone(origfname))
-			{
-				char		origpath[MAXPGPATH];
-				char		partialfname[MAXFNAMELEN];
-				char		partialpath[MAXPGPATH];
-
-				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
-				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
-				snprintf(partialpath, MAXPGPATH, "%s.partial", origpath);
-
-				/*
-				 * Make sure there's no .done or .ready file for the .partial
-				 * file.
-				 */
-				XLogArchiveCleanup(partialfname);
-
-				durable_rename(origpath, partialpath, ERROR);
-				XLogArchiveNotify(partialfname);
-			}
-		}
-	}
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Preallocate additional log files, if wanted.
@@ -8282,6 +8253,60 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Perform whatever XLOG actions are necessary at end of REDO.
+ *
+ * The goal here is to make sure that we'll be able to recover properly if
+ * we crash again. If we choose to write a checkpoint, we'll write a shutdown
+ * checkpoint rather than an on-line one. This is not particularly critical,
+ * but since we may be assigning a new TLI, using a shutdown checkpoint allows
+ * us to have the rule that TLI only changes in shutdown checkpoints, which
+ * allows some extra error checking in xlog_redo.
+ */
+static bool
+PerformRecoveryXLogAction(void)
+{
+	bool		promoted = false;
+
+	/*
+	 * Perform a checkpoint to update all our recovery activity to disk.
+	 *
+	 * Note that we write a shutdown checkpoint rather than an on-line one. This
+	 * is not particularly critical, but since we may be assigning a new TLI,
+	 * using a shutdown checkpoint allows us to have the rule that TLI only
+	 * changes in shutdown checkpoints, which allows some extra error checking
+	 * in xlog_redo.
+	 *
+	 * In promotion, only create a lightweight end-of-recovery record instead of
+	 * a full checkpoint. A checkpoint is requested later, after we're fully out
+	 * of recovery mode and already accepting queries.
+	 */
+	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+		LocalPromoteIsTriggered)
+	{
+		promoted = true;
+
+		/*
+		 * Insert a special WAL record to mark the end of recovery, since we
+		 * aren't doing a checkpoint. That means that the checkpointer process
+		 * may likely be in the middle of a time-smoothed restartpoint and could
+		 * continue to be for minutes after this.  That sounds strange, but the
+		 * effect is roughly the same and it would be stranger to try to come
+		 * out of the restartpoint and then checkpoint. We request a checkpoint
+		 * later anyway, just for safety.
+		 */
+		CreateEndOfRecoveryRecord();
+	}
+	else
+	{
+		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
+						  CHECKPOINT_IMMEDIATE |
+						  CHECKPOINT_WAIT);
+	}
+
+	return promoted;
+}
+
 /*
  * Is the system still in recovery?
  *
-- 
2.18.0

v38-0002-Postpone-some-end-of-recovery-operations-relatin.patchapplication/x-patch; name=v38-0002-Postpone-some-end-of-recovery-operations-relatin.patchDownload

From b0d1790e5aa95a02217efb4e635398d3086a7493 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 23 Jul 2021 14:27:51 -0400
Subject: [PATCH v38 2/4] Postpone some end-of-recovery operations relating to
 allowing WAL.

Previously, moved the code that performs whether to write a checkpoint
or an end-of-recovery record into PerformRecoveryXlogAction(), and
code performs post-archive-recovery into CleanupAfterArchiveRecovery(),
but called both the functions from the same place. Now postpone that
stuff until after we clear InRecovery and shut down the XLogReader.

We do find out of InRecovery value afterward by looking
XLogCtl->lastReplayedEndRecPtr, that will be only get set inside the
REDO loop.

This is preparatory work for a future patch that wants to allow
recovery to end at one time and only later start to allow WAL writes.
The steps that themselves write WAL clearly shouldn't happen before
we're ready to accept WAL writes, and it seems best for now to keep
the steps performed by CleanupAfterArchiveRecovery() at the same point
relative to the surrounding steps. We assume (hopefully correctly)
that the user doesn't want recovery_end_command to run until we're
committed to writing WAL on the new timeline. Until then, the
machine is still usable as a standby on the old timeline.

Aside from the value of this patch as preparatory work, this order of
operations actually seems more logical, since it means we don't
actually write any WAL until after exiting recovery.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 62 +++++++++++++++++--------------
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 44e5a0610ef..6612b81e4b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8018,34 +8018,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
-
-	/* Emit checkpoint or end-of-recovery record in XLOG, if required. */
-	if (InRecovery)
-		promoted = PerformRecoveryXLogAction();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -8090,6 +8062,40 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+	LocalXLogInsertAllowed = -1;
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if the server has been
+	 * through the archive or the crash recovery.
+	 *
+	 * If the recovery is performed lastReplayedEndRecPtr will always be a valid
+	 * record pointer that never changes after REDO loop.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
-- 
2.18.0

#178

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Amul Sul (#177)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul@gmail.com> wrote:

In the attached version I have fixed this issue by restoring missingContrecPtr.

To handle abortedRecPtr and missingContrecPtr newly added global
variables thought the commit # ff9f111bce24, we don't need to store
them in the shared memory separately, instead, we need a flag that
indicates a broken record found previously, at the end of recovery, so
that we can overwrite contrecord.

The missingContrecPtr is assigned to the EndOfLog, and we have handled
EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to
ensure that the lastReplayedEndRecPtr value is the same as the
abortedRecPtr, but I think that is not needed, we can go ahead and
write an overwrite-contrecord starting at lastReplayedEndRecPtr.

I thought that it made sense to commit 0001 and 0002 at this point, so
I have done that. I think that the treatment of missingContrecPtr and
abortedRecPtr may need more thought yet, so at least for that reason I
don't think it's a good idea to proceed with 0004 yet. 0003 is just
code movement so I guess that can be committed whenever we're
confident that we know exactly which things we want to end up inside
XLogAcceptWrites().

I do have a few ideas after studying this a bit more:

- I wonder whether, in addition to moving a few things later as 0002
did, we also ought to think about moving one thing earlier,
specifically XLogReportParameters(). Right now, we have, I believe,
four things that write WAL at the end of recovery:
CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
is structured now, we do the first three of those things, and then do
a bunch of other stuff inside CleanupAfterArchiveRecovery() like
running recovery_end_command, and removing non-parent xlog files, and
archiving the partial segment, and then come back and do the fourth
one. Is there any good reason for that? If not, I think doing them all
together would be cleaner, and would propose to reverse the order of
CleanupAfterArchiveRecovery() and XLogReportParameters().

- If we did that, then I would further propose to adjust things so
that we remove the call to LocalSetXLogInsertAllowed() and the
assignment LocalXLogInsertAllowed = -1 from inside
CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
after UpdateFullPageWrites(), and the call to
LocalSetXLogInsertAllowed() just before XLogReportParameters().
Instead, just let the call to LocalSetXLogInsertAllowed() right before
CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
to be much point in flipping that switch off and on again, and the
fact that we have been doing so is in my view just evidence that
StartupXLOG() doesn't do a very good job of getting related code all
into one place.

- It seems really tempting to invent a fourth RecoveryState value that
indicates that we are done with REDO but not yet in production, and
maybe also to rename RecoveryState to something like WALState. I'm
thinking of something like WAL_STATE_CRASH_RECOVERY,
WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
WAL_STATE_PRODUCTION. Then, instead of having
LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
the startup process and the checkpointer are allowed to insert WAL
when the state is WAL_STATE_REDO_COMPLETE, but other processes only
once we reach WAL_STATE_PRODUCTION. We would set
WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed().
It's necessary to include the checkpointer, or at least I think it is,
because PerformRecoveryXLogAction() might call RequestCheckpoint(),
and that's got to work. If we did this, then I think it would also
solve another problem which the overall patch set has to address
somehow. Say that we eventually move responsibility for the
to-be-created XLogAcceptWrites() function from the startup process to
the checkpointer, as proposed. The checkpointer needs to know when to
call it ... and the answer with this change is simple: when we reach
WAL_STATE_REDO_COMPLETE, it's time!

But this idea is not completely problem-free. I spent some time poking
at it and I think it's a little hard to come up with a satisfying way
to code XLogInsertAllowed(). Right now that function calls
RecoveryInProgress(), and if RecoveryInProgress() decides that
recovery is no longer in progress, it calls InitXLOGAccess(). However,
that presumes that the only reason you'd call RecoveryInProgress() is
to figure out whether you should write WAL, which I don't think is
really true, and it also means that, when the wal state is
WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return
true in the checkpointer and startup process and false everywhere
else, which does not sound like a great idea. It seems fine to say
that xlog insertion is allowed in some processes but not others,
because not all processes are necessarily equally privileged, but
whether or not we're in recovery is supposed to be something about
which everyone agrees, so answering that question differently in
different processes doesn't seem nice. XLogInsertAllowed() could be
rewritten to check the state directly and make its own determination,
without relying on RecoveryInProgress(), and I think that might be the
right way to go here.

But that isn't entirely problem-free either, because there's a lot of
code that uses RecoveryInProgress() to answer the question "should I
write WAL?" and therefore it's not great if RecoveryInProgress() is
returning an answer that is inconsistent with XLogInsertAllowed().
MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this
kind of coding. It probably wouldn't break in practice right away,
because most of that code never runs in the startup process or the
checkpointer and would therefore never notice the difference in
behavior between those two functions, but if in the future we get the
read-only feature that this thread is supposed to be about, we'd have
problems. Not all RecoveryInProgress() calls have this sense - e.g.
sendDir() in basebackup.c is trying to figure out whether recovery
ended during the backup, not whether we can write WAL. But perhaps
this is a good time to go and replace RecoveryInProgress() checks that
are intending to decide whether or not it's OK to write WAL with
XLogInsertAllowed() checks (noting that the return value is reversed).
If we did that, then I think RecoveryInProgress() could also NOT call
InitXLOGAccess(), and that could be done only by XLogInsertAllowed(),
which seems like it might be better. But I haven't really tried to
code all of this up, so I'm not really sure how it all works out.

--
Robert Haas
EDB: http://www.enterprisedb.com

#179

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Robert Haas (#178)

1 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Oct 14, 2021 at 11:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul@gmail.com> wrote:

In the attached version I have fixed this issue by restoring missingContrecPtr.

To handle abortedRecPtr and missingContrecPtr newly added global
variables thought the commit # ff9f111bce24, we don't need to store
them in the shared memory separately, instead, we need a flag that
indicates a broken record found previously, at the end of recovery, so
that we can overwrite contrecord.

The missingContrecPtr is assigned to the EndOfLog, and we have handled
EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to
ensure that the lastReplayedEndRecPtr value is the same as the
abortedRecPtr, but I think that is not needed, we can go ahead and
write an overwrite-contrecord starting at lastReplayedEndRecPtr.

I thought that it made sense to commit 0001 and 0002 at this point, so
I have done that. I think that the treatment of missingContrecPtr and
abortedRecPtr may need more thought yet, so at least for that reason I
don't think it's a good idea to proceed with 0004 yet. 0003 is just
code movement so I guess that can be committed whenever we're
confident that we know exactly which things we want to end up inside
XLogAcceptWrites().

Ok.

I do have a few ideas after studying this a bit more:

- I wonder whether, in addition to moving a few things later as 0002
did, we also ought to think about moving one thing earlier,
specifically XLogReportParameters(). Right now, we have, I believe,
four things that write WAL at the end of recovery:
CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
is structured now, we do the first three of those things, and then do
a bunch of other stuff inside CleanupAfterArchiveRecovery() like
running recovery_end_command, and removing non-parent xlog files, and
archiving the partial segment, and then come back and do the fourth
one. Is there any good reason for that? If not, I think doing them all
together would be cleaner, and would propose to reverse the order of
CleanupAfterArchiveRecovery() and XLogReportParameters().

Yes, that can be done.

- If we did that, then I would further propose to adjust things so
that we remove the call to LocalSetXLogInsertAllowed() and the
assignment LocalXLogInsertAllowed = -1 from inside
CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
after UpdateFullPageWrites(), and the call to
LocalSetXLogInsertAllowed() just before XLogReportParameters().
Instead, just let the call to LocalSetXLogInsertAllowed() right before
CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
to be much point in flipping that switch off and on again, and the
fact that we have been doing so is in my view just evidence that
StartupXLOG() doesn't do a very good job of getting related code all
into one place.

Currently there are three places that are calling
LocalSetXLogInsertAllowed() and resetting that LocalXLogInsertAllowed
flag as StartupXLOG(), CreateEndOfRecoveryRecord() and the
CreateCheckPoint(). By doing the aforementioned code rearrangement we
can get rid of frequent calls from StartupXLOG() and can completely
remove the need for it in CreateEndOfRecoveryRecord() since that gets
called only from StartupXLOG() directly. Whereas CreateCheckPoint()
too gets called from StartupXLOG() when it is running in a standalone
backend only, at that time we don't need to call
LocalSetXLogInsertAllowed() but if that running in the Checkpointer
process then we need that.

I tried this in the attached version, but I'm a bit skeptical with
changes that are needed for CreateCheckPoint(), those don't seem to be
clean. I am wondering if we could completely remove the need to end of
recovery checkpoint as proposed in [1], that would get rid of
CHECKPOINT_END_OF_RECOVERY operation and the
LocalSetXLogInsertAllowed() requirement in CreateCheckPoint(), and
after that, we were not expecting checkpoint operation in recovery. If
we could do that then we would have LocalSetXLogInsertAllowed() only
at one place i.e. in StartupXLOG (...and in the future in
XLogAcceptWrites()) -- the code that runs only once in a lifetime of
the server and the kludge that the attached patch doing for
CreateCheckPoint() will not be needed.

- It seems really tempting to invent a fourth RecoveryState value that
indicates that we are done with REDO but not yet in production, and
maybe also to rename RecoveryState to something like WALState. I'm
thinking of something like WAL_STATE_CRASH_RECOVERY,
WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
WAL_STATE_PRODUCTION. Then, instead of having
LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
the startup process and the checkpointer are allowed to insert WAL
when the state is WAL_STATE_REDO_COMPLETE, but other processes only
once we reach WAL_STATE_PRODUCTION. We would set
WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed().
It's necessary to include the checkpointer, or at least I think it is,
because PerformRecoveryXLogAction() might call RequestCheckpoint(),
and that's got to work. If we did this, then I think it would also
solve another problem which the overall patch set has to address
somehow. Say that we eventually move responsibility for the
to-be-created XLogAcceptWrites() function from the startup process to
the checkpointer, as proposed. The checkpointer needs to know when to
call it ... and the answer with this change is simple: when we reach
WAL_STATE_REDO_COMPLETE, it's time!

But this idea is not completely problem-free. I spent some time poking
at it and I think it's a little hard to come up with a satisfying way
to code XLogInsertAllowed(). Right now that function calls
RecoveryInProgress(), and if RecoveryInProgress() decides that
recovery is no longer in progress, it calls InitXLOGAccess(). However,
that presumes that the only reason you'd call RecoveryInProgress() is
to figure out whether you should write WAL, which I don't think is
really true, and it also means that, when the wal state is
WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return
true in the checkpointer and startup process and false everywhere
else, which does not sound like a great idea. It seems fine to say
that xlog insertion is allowed in some processes but not others,
because not all processes are necessarily equally privileged, but
whether or not we're in recovery is supposed to be something about
which everyone agrees, so answering that question differently in
different processes doesn't seem nice. XLogInsertAllowed() could be
rewritten to check the state directly and make its own determination,
without relying on RecoveryInProgress(), and I think that might be the
right way to go here.

But that isn't entirely problem-free either, because there's a lot of
code that uses RecoveryInProgress() to answer the question "should I
write WAL?" and therefore it's not great if RecoveryInProgress() is
returning an answer that is inconsistent with XLogInsertAllowed().
MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this
kind of coding. It probably wouldn't break in practice right away,
because most of that code never runs in the startup process or the
checkpointer and would therefore never notice the difference in
behavior between those two functions, but if in the future we get the
read-only feature that this thread is supposed to be about, we'd have
problems. Not all RecoveryInProgress() calls have this sense - e.g.
sendDir() in basebackup.c is trying to figure out whether recovery
ended during the backup, not whether we can write WAL. But perhaps
this is a good time to go and replace RecoveryInProgress() checks that
are intending to decide whether or not it's OK to write WAL with
XLogInsertAllowed() checks (noting that the return value is reversed).
If we did that, then I think RecoveryInProgress() could also NOT call
InitXLOGAccess(), and that could be done only by XLogInsertAllowed(),
which seems like it might be better. But I haven't really tried to
code all of this up, so I'm not really sure how it all works out.

I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
good, but I am not sure about calling it from XLogInsertAllowed()
either, perhaps, both are status check function and general
expectations might be that status checking functions are not going
change and/or initialize the system state. InitXLOGAccess() should
get called from the very first WAL write operation if needed, but if
we don't want to do that, then I would prefer to call InitXLOGAccess()
from XLogInsertAllowed() instead of RecoveryInProgress().

As said before, if we were able to get rid of the need to
end-of-recovery checkpoint [1] then we don't need separate handling in
XLogInsertAllowed() for the Checkpointer process, that would be much
cleaner and for the startup process, we would force
XLogInsertAllowed() return true by calling LocalSetXLogInsertAllowed()
for the time being as we are doing right now.

Regards,
Amul

1] "using an end-of-recovery record in all cases" :
/messages/by-id/CAAJ_b95xPx6oHRb5VEatGbp-cLsZApf_9GWGtbv9dsFKiV_VDQ@mail.gmail.com

Attachments:

POC-rearrange-code-to-remove-frequent-need-of-LocalS.patchapplication/x-patch; name=POC-rearrange-code-to-remove-frequent-need-of-LocalS.patchDownload

From bcea68c1925a018967e2a8ea42c54552d623234c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 15 Oct 2021 08:44:09 -0400
Subject: [PATCH] POC rearrange code to remove frequent need of
 LocalSetXLogInsertAllowed()

---
 src/backend/access/transam/xlog.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 62862255fca..ae561ed7e30 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8080,7 +8080,6 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
 
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
@@ -8094,17 +8093,18 @@ StartupXLOG(void)
 	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
 		promoted = PerformRecoveryXLogAction();
 
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
-	LocalSetXLogInsertAllowed();
 	XLogReportParameters();
 
+	LocalXLogInsertAllowed = -1;
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
 	 * commit timestamp.
@@ -9126,7 +9126,7 @@ CreateCheckPoint(int flags)
 	 * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
 	 * initialized, which we need here and in AdvanceXLInsertBuffer.)
 	 */
-	if (flags & CHECKPOINT_END_OF_RECOVERY)
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) && IsPostmasterEnvironment)
 		LocalSetXLogInsertAllowed();
 
 	checkPoint.ThisTimeLineID = ThisTimeLineID;
@@ -9304,7 +9304,7 @@ CreateCheckPoint(int flags)
 	 * to just temporarily disable writing until the system has exited
 	 * recovery.
 	 */
-	if (shutdown)
+	if (shutdown && IsPostmasterEnvironment)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
 			LocalXLogInsertAllowed = -1;	/* return to "check" state */
@@ -9447,8 +9447,6 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
-	LocalSetXLogInsertAllowed();
-
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9469,8 +9467,6 @@ CreateEndOfRecoveryRecord(void)
 	LWLockRelease(ControlFileLock);
 
 	END_CRIT_SECTION();
-
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
 }
 
 /*
-- 
2.18.0

#180

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Amul Sul (#179)

2 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Oct 18, 2021 at 9:54 AM Amul Sul <sulamul@gmail.com> wrote:

I tried this in the attached version, but I'm a bit skeptical with
changes that are needed for CreateCheckPoint(), those don't seem to be
clean.

Yeah, that doesn't look great. I don't think it's entirely correct,
actually, because surely you want LocalXLogInsertAllowed = 0 to be
executed even if !IsPostmasterEnvironment. It's only
LocalXLogInsertAllowed = -1 that we would want to have depend on
IsPostmasterEnvironment. But that's pretty ugly too: I guess the
reason it has to be like is that, if it does that unconditionally, it
will overwrite the temporary value of 1 set by the caller, which will
then cause problems when the caller tries to XLogReportParameters().

I think that problem goes away if we drive the decision off of shared
state rather than a local variable, but I agree that it's otherwise a
bit tricky to untangle. One idea might be to have
LocalSetXLogInsertAllowed return the old value. Then we could use the
same kind of coding we do when switching memory contexts, where we
say:

oldcontext = MemoryContextSwitchTo(something);
// do stuff
MemoryContextSwitchTo(oldcontext);

Here we could maybe do:

oldxlallowed = LocalSetXLogInsertAllowed();
// do stuff
XLogInsertAllowed = oldxlallowed;

That way, instead of CreateCheckPoint() knowing under what
circumstances the caller might have changed the value, it only knows
that some callers might have already changed the value. That seems
better.

I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
good, but I am not sure about calling it from XLogInsertAllowed()
either, perhaps, both are status check function and general
expectations might be that status checking functions are not going
change and/or initialize the system state. InitXLOGAccess() should
get called from the very first WAL write operation if needed, but if
we don't want to do that, then I would prefer to call InitXLOGAccess()
from XLogInsertAllowed() instead of RecoveryInProgress().

Well, that's a fair point, too, but it might not be safe to, say, move
this to XLogBeginInsert(). Like, imagine that there's a hypothetical
piece of code that looks like this:

if (RecoveryInProgress())
ereport(ERROR, errmsg("can't do that in recovery")));

// do something here that depends on ThisTimeLineID or
wal_segment_size or RedoRecPtr

XLogBeginInsert();
....
lsn = XLogInsert(...);

Such code would work correctly the way things are today, but if the
InitXLOGAccess() call were deferred until XLogBeginInsert() time, then
it would fail.

I was curious whether this is just a theoretical problem. It turns out
that it's not. I wrote a couple of just-for-testing patches, which I
attach here. The first one just adjusts things so that we'll fail an
assertion if we try to make use of ThisTimeLineID before we've set it
to a legal value. I had to exempt two places from these checks just
for 'make check-world' to pass; these are shown in the patch, and one
or both of them might be existing bugs -- or maybe not, I haven't
looked too deeply. The second one then adjusts the patch to pretend
that ThisTimeLineID is not necessarily valid just because we've called
InitXLOGAccess() but that it is valid after XLogBeginInsert(). With
that change, I find about a dozen places where, apparently, the early
call to InitXLOGAccess() is critical to getting ThisTimeLineID
adjusted in time. So apparently a change of this type is not entirely
trivial. And this is just a quick test, and just for one of the three
things that get initialized here.

On the other hand, just moving it to XLogInsertAllowed() isn't
risk-free either and would likely require adjusting some of the same
places I found with this test. So I guess if we want to do something
like this we need more study.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0002-Pretend-it-s-not-valid-until-XLogBeginInsert.patchapplication/octet-stream; name=0002-Pretend-it-s-not-valid-until-XLogBeginInsert.patchDownload

From 6b677889bbf77f1104c922eb744424730949c5f1 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Oct 2021 18:18:05 -0400
Subject: [PATCH 2/2] Pretend it's not valid until XLogBeginInsert.

---
 src/backend/access/transam/xlog.c       | 11 +++++------
 src/backend/access/transam/xlogfuncs.c  |  2 +-
 src/backend/access/transam/xloginsert.c |  1 +
 src/backend/access/transam/xlogutils.c  | 10 +++++-----
 src/backend/replication/walsender.c     | 10 +++++-----
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f1e6d360f..864d76da54 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2241,7 +2241,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		NewPage->xlp_magic = XLOG_PAGE_MAGIC;
 
 		/* NewPage->xlp_info = 0; */	/* done by memset */
-		NewPage->xlp_tli = ThisTimeLineIDChecked;
+		NewPage->xlp_tli = ThisTimeLineID; //  XXX wouldn't be OK if we deferred until XLogBeginInsert
 		NewPage->xlp_pageaddr = NewPageBeginPtr;
 
 		/* NewPage->xlp_rem_len = 0; */	/* done by memset */
@@ -3294,7 +3294,7 @@ XLogFileInitInternal(XLogSegNo logsegno, bool *added, char *path)
 	int			fd;
 	int			save_errno;
 
-	XLogFilePath(path, ThisTimeLineIDChecked, logsegno, wal_segment_size);
+	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size); // XXX wouldn't be OK if we deferred until XLogBeginInsert
 
 	/*
 	 * Try to use existent file (checkpoint maker may have created it already)
@@ -3653,7 +3653,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
 
-	XLogFilePath(path, ThisTimeLineIDChecked, *segno, wal_segment_size);
+	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);  // XXX wouldn't be OK if we deferred until XLogBeginInsert
 
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (!XLogCtl->InstallXLogFileSegmentActive)
@@ -8602,7 +8602,6 @@ InitXLOGAccess(void)
 
 	/* ThisTimeLineID doesn't change so we need no lock to copy it */
 	ThisTimeLineID = XLogCtl->ThisTimeLineID;
-	ThisTimeLineIDValid = true;
 	Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
 
 	/* set wal_segment_size */
@@ -9135,11 +9134,11 @@ CreateCheckPoint(int flags)
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		LocalSetXLogInsertAllowed();
 
-	checkPoint.ThisTimeLineID = ThisTimeLineIDChecked;
+	checkPoint.ThisTimeLineID = ThisTimeLineID;	// XXX wouldn't be OK if we deferred until XLogBeginInsert
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	else
-		checkPoint.PrevTimeLineID = ThisTimeLineIDChecked;
+		checkPoint.PrevTimeLineID = ThisTimeLineID; // XXX wouldn't be OK if we deferred until XLogBeginInsert
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index dada5675e5..4815fd723c 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -511,7 +511,7 @@ pg_walfile_name(PG_FUNCTION_ARGS)
 						 "pg_walfile_name()")));
 
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
-	XLogFileName(xlogfilename, ThisTimeLineIDChecked, xlogsegno, wal_segment_size);
+	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size); // XXX wouldn't be OK if we deferred until XLogBeginInsert
 
 	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
 }
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b492c656d7..2c2e9ad28d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -145,6 +145,7 @@ XLogBeginInsert(void)
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
 
+	ThisTimeLineIDValid = true;
 	begininsert_called = true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 802517c881..ec16bd0ae0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -737,7 +737,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * it looked up the timeline. There's nothing we can do about it if
 	 * StartupXLOG() renames it to .partial concurrently.
 	 */
-	if (state->currTLI == ThisTimeLineIDChecked && wantPage >= lastReadPage)
+	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage) // XXX wouldn't be OK if we deferred until XLogBeginInsert
 	{
 		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
 		return;
@@ -749,7 +749,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * the current segment we can just keep reading.
 	 */
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
-		state->currTLI != ThisTimeLineIDChecked &&
+		state->currTLI != ThisTimeLineID && // XXX wouldn't be OK if we deferred until XLogBeginInsert
 		state->currTLI != 0 &&
 		((wantPage + wantLength) / state->segcxt.ws_segsize) <
 		(state->currTLIValidUntil / state->segcxt.ws_segsize))
@@ -772,7 +772,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 * We need to re-read the timeline history in case it's been changed
 		 * by a promotion or replay from a cascaded replica.
 		 */
-		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineIDChecked);
+		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);  // XXX wouldn't be OK if we deferred until XLogBeginInsert
 		XLogRecPtr	endOfSegment;
 
 		endOfSegment = ((wantPage / state->segcxt.ws_segsize) + 1) *
@@ -874,7 +874,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 			ThisTimeLineIDValid = true;
 		}
-		tli = ThisTimeLineIDChecked;
+		tli = ThisTimeLineID; // XXX wouldn't be OK if we deferred until XLogBeginInsert
 
 		/*
 		 * Check which timeline to get the record from.
@@ -902,7 +902,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 */
 		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 
-		if (state->currTLI == ThisTimeLineIDChecked)
+		if (state->currTLI == ThisTimeLineID)  // XXX wouldn't be OK if we deferred until XLogBeginInsert
 		{
 
 			if (loc <= read_upto)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 73a6e06e6b..25dc7e4206 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -440,7 +440,7 @@ IdentifySystem(void)
 	values[0] = CStringGetTextDatum(sysid);
 
 	/* column 2: timeline */
-	values[1] = Int32GetDatum(ThisTimeLineIDChecked);
+	values[1] = Int32GetDatum(ThisTimeLineID); // XXX wouldn't be OK if we deferred until XLogBeginInsert
 
 	/* column 3: wal location */
 	values[2] = CStringGetTextDatum(xloc);
@@ -628,7 +628,7 @@ StartReplication(StartReplicationCmd *cmd)
 		XLogRecPtr	switchpoint;
 
 		sendTimeLine = cmd->timeline;
-		if (sendTimeLine == ThisTimeLineIDChecked)
+		if (sendTimeLine == ThisTimeLineID) // XXX wouldn't be OK if we deferred until XLogBeginInsert
 		{
 			sendTimeLineIsHistoric = false;
 			sendTimeLineValidUpto = InvalidXLogRecPtr;
@@ -643,7 +643,7 @@ StartReplication(StartReplicationCmd *cmd)
 			 * Check that the timeline the client requested exists, and the
 			 * requested start location is on that timeline.
 			 */
-			timeLineHistory = readTimeLineHistory(ThisTimeLineIDChecked);
+			timeLineHistory = readTimeLineHistory(ThisTimeLineID); // XXX wouldn't be OK if we deferred until XLogBeginInsert
 			switchpoint = tliSwitchPoint(cmd->timeline, timeLineHistory,
 										 &sendTimeLineNextTLI);
 			list_free_deep(timeLineHistory);
@@ -682,7 +682,7 @@ StartReplication(StartReplicationCmd *cmd)
 	}
 	else
 	{
-		sendTimeLine = ThisTimeLineIDChecked;
+		sendTimeLine = ThisTimeLineID;  // XXX wouldn't be OK if we deferred until XLogBeginInsert
 		sendTimeLineValidUpto = InvalidXLogRecPtr;
 		sendTimeLineIsHistoric = false;
 	}
@@ -812,7 +812,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
-	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineIDChecked);
+	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);  // XXX wouldn't be OK if we deferred until XLogBeginInsert
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
-- 
2.24.3 (Apple Git-128)

0001-Test-code-to-see-whether-we-have-always-properly-ini.patchapplication/octet-stream; name=0001-Test-code-to-see-whether-we-have-always-properly-ini.patchDownload

From c95e8c57a6e3a187a5b5c538e2fdc1e9d4d7cede Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Oct 2021 16:28:18 -0400
Subject: [PATCH 1/2] Test code to see whether we have always properly
 initialized ThisTimeLineID.

---
 src/backend/access/transam/twophase.c         |   4 +-
 src/backend/access/transam/xlog.c             | 108 ++++++++++--------
 src/backend/access/transam/xlogarchive.c      |   2 +-
 src/backend/access/transam/xlogfuncs.c        |   4 +-
 src/backend/access/transam/xlogutils.c        |  13 ++-
 src/backend/replication/basebackup.c          |  12 +-
 .../replication/logical/logicalfuncs.c        |   4 +
 src/backend/replication/slotfuncs.c           |   4 +
 src/backend/replication/walreceiver.c         |   5 +-
 src/backend/replication/walsender.c           |  19 +--
 src/include/access/xlog.h                     |   4 +
 11 files changed, 105 insertions(+), 74 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c..43cab819da 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1328,7 +1328,8 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
-	TimeLineID	save_currtli = ThisTimeLineID;
+	TimeLineID	save_currtli = ThisTimeLineID;  // XXX ThisTimeLineIDChecked fails assertion, ThisTimeLineID = 0!!!
+	bool		save_currtli_valid = ThisTimeLineIDValid;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1350,6 +1351,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	 * while recovery was finishing or if the timeline has jumped in-between.
 	 */
 	ThisTimeLineID = save_currtli;
+	ThisTimeLineIDValid = save_currtli_valid;
 
 	if (record == NULL)
 		ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 62862255fc..5f1e6d360f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -192,6 +192,7 @@ CheckpointStatsData CheckpointStats;
  * WAL timeline for the database system.
  */
 TimeLineID	ThisTimeLineID = 0;
+bool		ThisTimeLineIDValid = false;
 
 static XLogRecPtr LastRec;
 
@@ -2240,7 +2241,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		NewPage->xlp_magic = XLOG_PAGE_MAGIC;
 
 		/* NewPage->xlp_info = 0; */	/* done by memset */
-		NewPage->xlp_tli = ThisTimeLineID;
+		NewPage->xlp_tli = ThisTimeLineIDChecked;
 		NewPage->xlp_pageaddr = NewPageBeginPtr;
 
 		/* NewPage->xlp_rem_len = 0; */	/* done by memset */
@@ -2590,7 +2591,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 						continue;
 
 					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+					XLogFileName(xlogfname, ThisTimeLineIDChecked, openLogSegNo,
 								 wal_segment_size);
 					errno = save_errno;
 					ereport(PANIC,
@@ -3293,7 +3294,7 @@ XLogFileInitInternal(XLogSegNo logsegno, bool *added, char *path)
 	int			fd;
 	int			save_errno;
 
-	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
+	XLogFilePath(path, ThisTimeLineIDChecked, logsegno, wal_segment_size);
 
 	/*
 	 * Try to use existent file (checkpoint maker may have created it already)
@@ -3652,7 +3653,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
 
-	XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
+	XLogFilePath(path, ThisTimeLineIDChecked, *segno, wal_segment_size);
 
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (!XLogCtl->InstallXLogFileSegmentActive)
@@ -3678,7 +3679,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 				return false;
 			}
 			(*segno)++;
-			XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
+			XLogFilePath(path, ThisTimeLineIDChecked, *segno, wal_segment_size);
 		}
 	}
 
@@ -3707,7 +3708,7 @@ XLogFileOpen(XLogSegNo segno)
 	char		path[MAXPGPATH];
 	int			fd;
 
-	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
+	XLogFilePath(path, ThisTimeLineIDChecked, segno, wal_segment_size);
 
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
 	if (fd < 0)
@@ -3928,7 +3929,7 @@ XLogFileClose(void)
 		char		xlogfname[MAXFNAMELEN];
 		int			save_errno = errno;
 
-		XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo, wal_segment_size);
+		XLogFileName(xlogfname, ThisTimeLineIDChecked, openLogSegNo, wal_segment_size);
 		errno = save_errno;
 		ereport(PANIC,
 				(errcode_for_file_access(),
@@ -4510,7 +4511,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				if (ControlFile->minRecoveryPoint < EndRecPtr)
 				{
 					ControlFile->minRecoveryPoint = EndRecPtr;
-					ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+					ControlFile->minRecoveryPointTLI = ThisTimeLineIDChecked;
 				}
 				/* update local copy */
 				minRecoveryPoint = ControlFile->minRecoveryPoint;
@@ -4606,7 +4607,7 @@ rescanLatestTimeLine(void)
 		ereport(LOG,
 				(errmsg("new timeline %u is not a child of database system timeline %u",
 						newtarget,
-						ThisTimeLineID)));
+						ThisTimeLineIDChecked)));
 		return false;
 	}
 
@@ -4620,7 +4621,7 @@ rescanLatestTimeLine(void)
 		ereport(LOG,
 				(errmsg("new timeline %u forked off current database system timeline %u before current recovery point %X/%X",
 						newtarget,
-						ThisTimeLineID,
+						ThisTimeLineIDChecked,
 						LSN_FORMAT_ARGS(EndRecPtr))));
 		return false;
 	}
@@ -5316,6 +5317,7 @@ BootStrapXLOG(void)
 
 	/* First timeline ID is always 1 */
 	ThisTimeLineID = 1;
+	ThisTimeLineIDValid = true;
 
 	/* page buffer must be aligned suitably for O_DIRECT */
 	buffer = (char *) palloc(XLOG_BLCKSZ + XLOG_BLCKSZ);
@@ -5330,8 +5332,8 @@ BootStrapXLOG(void)
 	 * used, so that we can use 0/0 to mean "before any valid WAL segment".
 	 */
 	checkPoint.redo = wal_segment_size + SizeOfXLogLongPHD;
-	checkPoint.ThisTimeLineID = ThisTimeLineID;
-	checkPoint.PrevTimeLineID = ThisTimeLineID;
+	checkPoint.ThisTimeLineID = ThisTimeLineIDChecked;
+	checkPoint.PrevTimeLineID = ThisTimeLineIDChecked;
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.nextXid =
 		FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
@@ -5359,7 +5361,7 @@ BootStrapXLOG(void)
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
 	page->xlp_info = XLP_LONG_HEADER;
-	page->xlp_tli = ThisTimeLineID;
+	page->xlp_tli = ThisTimeLineIDChecked;
 	page->xlp_pageaddr = wal_segment_size;
 	longpage = (XLogLongPageHeader) page;
 	longpage->xlp_sysid = sysidentifier;
@@ -5640,7 +5642,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	XLogSegNo	startLogSegNo;
 
 	/* we always switch to a new timeline after archive recovery */
-	Assert(endTLI != ThisTimeLineID);
+	Assert(endTLI != ThisTimeLineIDChecked);
 
 	/*
 	 * We are no longer in archive recovery state.
@@ -5704,7 +5706,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			char		xlogfname[MAXFNAMELEN];
 			int			save_errno = errno;
 
-			XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+			XLogFileName(xlogfname, ThisTimeLineIDChecked, startLogSegNo,
 						 wal_segment_size);
 			errno = save_errno;
 			ereport(ERROR,
@@ -5717,7 +5719,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * Let's just make real sure there are not .ready or .done flags posted
 	 * for the new segment.
 	 */
-	XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo, wal_segment_size);
+	XLogFileName(xlogfname, ThisTimeLineIDChecked, startLogSegNo, wal_segment_size);
 	XLogArchiveCleanup(xlogfname);
 
 	/*
@@ -5756,7 +5758,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
-	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
+	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineIDChecked);
 
 	/*
 	 * If the switch happened in the middle of a segment, what to do with the
@@ -7137,6 +7139,7 @@ StartupXLOG(void)
 	 * also xlog_redo()).
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
+	ThisTimeLineIDValid = true;
 
 	/*
 	 * Copy any missing timeline history files between 'now' and the recovery
@@ -7150,7 +7153,7 @@ StartupXLOG(void)
 	 * are small, so it's better to copy them unnecessarily than not copy them
 	 * and regret later.
 	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	restoreTimeLineHistoryFiles(ThisTimeLineIDChecked, recoveryTargetTLI);
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -7434,7 +7437,7 @@ StartupXLOG(void)
 			XLogCtl->replayEndRecPtr = checkPoint.redo;
 		else
 			XLogCtl->replayEndRecPtr = EndRecPtr;
-		XLogCtl->replayEndTLI = ThisTimeLineID;
+		XLogCtl->replayEndTLI = ThisTimeLineIDChecked;
 		XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr;
 		XLogCtl->lastReplayedTLI = XLogCtl->replayEndTLI;
 		XLogCtl->recoveryLastXTime = 0;
@@ -7586,8 +7589,8 @@ StartupXLOG(void)
 				 */
 				if (record->xl_rmid == RM_XLOG_ID)
 				{
-					TimeLineID	newTLI = ThisTimeLineID;
-					TimeLineID	prevTLI = ThisTimeLineID;
+					TimeLineID	newTLI = ThisTimeLineIDChecked;
+					TimeLineID	prevTLI = ThisTimeLineIDChecked;
 					uint8		info = record->xl_info & ~XLR_INFO_MASK;
 
 					if (info == XLOG_CHECKPOINT_SHUTDOWN)
@@ -7607,13 +7610,14 @@ StartupXLOG(void)
 						prevTLI = xlrec.PrevTimeLineID;
 					}
 
-					if (newTLI != ThisTimeLineID)
+					if (newTLI != ThisTimeLineIDChecked)
 					{
 						/* Check that it's OK to switch to this TLI */
 						checkTimeLineSwitch(EndRecPtr, newTLI, prevTLI);
 
 						/* Following WAL records should be run with new TLI */
 						ThisTimeLineID = newTLI;
+						ThisTimeLineIDValid = true;
 						switchedTLI = true;
 					}
 				}
@@ -7624,7 +7628,7 @@ StartupXLOG(void)
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->replayEndRecPtr = EndRecPtr;
-				XLogCtl->replayEndTLI = ThisTimeLineID;
+				XLogCtl->replayEndTLI = ThisTimeLineIDChecked;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
@@ -7656,7 +7660,7 @@ StartupXLOG(void)
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->lastReplayedEndRecPtr = EndRecPtr;
-				XLogCtl->lastReplayedTLI = ThisTimeLineID;
+				XLogCtl->lastReplayedTLI = ThisTimeLineIDChecked;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
@@ -7684,7 +7688,7 @@ StartupXLOG(void)
 					 * (possibly bogus) future WAL segments on the old
 					 * timeline.
 					 */
-					RemoveNonParentXlogFiles(EndRecPtr, ThisTimeLineID);
+					RemoveNonParentXlogFiles(EndRecPtr, ThisTimeLineIDChecked);
 
 					/*
 					 * Wake up any walsenders to notice that we are on a new
@@ -7903,7 +7907,7 @@ StartupXLOG(void)
 	 *
 	 * In a normal crash recovery, we can just extend the timeline we were in.
 	 */
-	PrevTimeLineID = ThisTimeLineID;
+	PrevTimeLineID = ThisTimeLineIDChecked;
 	if (ArchiveRecoveryRequested)
 	{
 		char	   *reason;
@@ -7912,8 +7916,9 @@ StartupXLOG(void)
 		Assert(InArchiveRecovery);
 
 		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
+		ThisTimeLineIDValid = true;
 		ereport(LOG,
-				(errmsg("selected new timeline ID: %u", ThisTimeLineID)));
+				(errmsg("selected new timeline ID: %u", ThisTimeLineIDChecked)));
 
 		reason = getRecoveryStopReason();
 
@@ -7935,7 +7940,7 @@ StartupXLOG(void)
 		 * To minimize the window for that, try to do as little as possible
 		 * between here and writing the end-of-recovery record.
 		 */
-		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
+		writeTimeLineHistory(ThisTimeLineIDChecked, recoveryTargetTLI,
 							 EndRecPtr, reason);
 
 		/*
@@ -7951,7 +7956,7 @@ StartupXLOG(void)
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */
-	XLogCtl->ThisTimeLineID = ThisTimeLineID;
+	XLogCtl->ThisTimeLineID = ThisTimeLineIDChecked;
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
@@ -8597,6 +8602,7 @@ InitXLOGAccess(void)
 
 	/* ThisTimeLineID doesn't change so we need no lock to copy it */
 	ThisTimeLineID = XLogCtl->ThisTimeLineID;
+	ThisTimeLineIDValid = true;
 	Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
 
 	/* set wal_segment_size */
@@ -9129,11 +9135,11 @@ CreateCheckPoint(int flags)
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		LocalSetXLogInsertAllowed();
 
-	checkPoint.ThisTimeLineID = ThisTimeLineID;
+	checkPoint.ThisTimeLineID = ThisTimeLineIDChecked;
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	else
-		checkPoint.PrevTimeLineID = ThisTimeLineID;
+		checkPoint.PrevTimeLineID = ThisTimeLineIDChecked;
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
@@ -9443,7 +9449,7 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.end_time = GetCurrentTimestamp();
 
 	WALInsertLockAcquireExclusive();
-	xlrec.ThisTimeLineID = ThisTimeLineID;
+	xlrec.ThisTimeLineID = ThisTimeLineIDChecked;
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
@@ -9464,7 +9470,7 @@ CreateEndOfRecoveryRecord(void)
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	ControlFile->time = (pg_time_t) time(NULL);
 	ControlFile->minRecoveryPoint = recptr;
-	ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+	ControlFile->minRecoveryPointTLI = ThisTimeLineIDChecked;
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
@@ -9800,7 +9806,10 @@ CreateRestartPoint(int flags)
 	 * with that.
 	 */
 	if (RecoveryInProgress())
+	{
 		ThisTimeLineID = replayTLI;
+		ThisTimeLineIDValid = true;
+	}
 
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
@@ -9817,7 +9826,10 @@ CreateRestartPoint(int flags)
 	 * to restore the normal state of affairs for debugging purposes.
 	 */
 	if (RecoveryInProgress())
+	{
 		ThisTimeLineID = 0;
+		ThisTimeLineIDValid = false;
+	}
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -10228,19 +10240,19 @@ static void
 checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI, TimeLineID prevTLI)
 {
 	/* Check that the record agrees on what the current (old) timeline is */
-	if (prevTLI != ThisTimeLineID)
+	if (prevTLI != ThisTimeLineIDChecked)
 		ereport(PANIC,
 				(errmsg("unexpected previous timeline ID %u (current timeline ID %u) in checkpoint record",
-						prevTLI, ThisTimeLineID)));
+						prevTLI, ThisTimeLineIDChecked)));
 
 	/*
 	 * The new timeline better be in the list of timelines we expect to see,
 	 * according to the timeline history. It should also not decrease.
 	 */
-	if (newTLI < ThisTimeLineID || !tliInHistory(newTLI, expectedTLEs))
+	if (newTLI < ThisTimeLineIDChecked || !tliInHistory(newTLI, expectedTLEs))
 		ereport(PANIC,
 				(errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
-						newTLI, ThisTimeLineID)));
+						newTLI, ThisTimeLineIDChecked)));
 
 	/*
 	 * If we have not yet reached min recovery point, and we're about to
@@ -10387,10 +10399,10 @@ xlog_redo(XLogReaderState *record)
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
 		 */
-		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
+		if (checkPoint.ThisTimeLineID != ThisTimeLineIDChecked)
 			ereport(PANIC,
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
-							checkPoint.ThisTimeLineID, ThisTimeLineID)));
+							checkPoint.ThisTimeLineID, ThisTimeLineIDChecked)));
 
 		RecoveryRestartPoint(&checkPoint);
 	}
@@ -10443,10 +10455,10 @@ xlog_redo(XLogReaderState *record)
 		SpinLockRelease(&XLogCtl->info_lck);
 
 		/* TLI should not change in an on-line checkpoint */
-		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
+		if (checkPoint.ThisTimeLineID != ThisTimeLineIDChecked)
 			ereport(PANIC,
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
-							checkPoint.ThisTimeLineID, ThisTimeLineID)));
+							checkPoint.ThisTimeLineID, ThisTimeLineIDChecked)));
 
 		RecoveryRestartPoint(&checkPoint);
 	}
@@ -10473,10 +10485,10 @@ xlog_redo(XLogReaderState *record)
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
 		 */
-		if (xlrec.ThisTimeLineID != ThisTimeLineID)
+		if (xlrec.ThisTimeLineID != ThisTimeLineIDChecked)
 			ereport(PANIC,
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
-							xlrec.ThisTimeLineID, ThisTimeLineID)));
+							xlrec.ThisTimeLineID, ThisTimeLineIDChecked)));
 	}
 	else if (info == XLOG_NOOP)
 	{
@@ -10546,7 +10558,7 @@ xlog_redo(XLogReaderState *record)
 			if (ControlFile->minRecoveryPoint < lsn)
 			{
 				ControlFile->minRecoveryPoint = lsn;
-				ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+				ControlFile->minRecoveryPointTLI = ThisTimeLineIDChecked;
 			}
 			ControlFile->backupStartPoint = InvalidXLogRecPtr;
 			ControlFile->backupEndRequired = false;
@@ -10587,7 +10599,7 @@ xlog_redo(XLogReaderState *record)
 		if (minRecoveryPoint != InvalidXLogRecPtr && minRecoveryPoint < lsn)
 		{
 			ControlFile->minRecoveryPoint = lsn;
-			ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+			ControlFile->minRecoveryPointTLI = ThisTimeLineIDChecked;
 		}
 
 		CommitTsParameterChange(xlrec.track_commit_timestamp,
@@ -10800,7 +10812,7 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 				int			save_errno;
 
 				save_errno = errno;
-				XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+				XLogFileName(xlogfname, ThisTimeLineIDChecked, openLogSegNo,
 							 wal_segment_size);
 				errno = save_errno;
 				ereport(PANIC,
@@ -10876,7 +10888,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 		char		xlogfname[MAXFNAMELEN];
 		int			save_errno = errno;
 
-		XLogFileName(xlogfname, ThisTimeLineID, segno,
+		XLogFileName(xlogfname, ThisTimeLineIDChecked, segno,
 					 wal_segment_size);
 		errno = save_errno;
 		ereport(PANIC,
@@ -11721,7 +11733,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&startpoint), sizeof(startpoint));
 		stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);
-		stoptli = ThisTimeLineID;
+		stoptli = ThisTimeLineIDChecked;
 
 		/*
 		 * Force a switch to a new xlog segment file, so that the backup is
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e754..24fe509fdd 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -502,7 +502,7 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 {
 	char		xlog[MAXFNAMELEN];
 
-	XLogFileName(xlog, ThisTimeLineID, segno, wal_segment_size);
+	XLogFileName(xlog, ThisTimeLineIDChecked, segno, wal_segment_size);
 	XLogArchiveNotify(xlog);
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..dada5675e5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -469,7 +469,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 	 * xlogfilename
 	 */
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
+	XLogFileName(xlogfilename, ThisTimeLineIDChecked, xlogsegno, wal_segment_size);
 
 	values[0] = CStringGetTextDatum(xlogfilename);
 	isnull[0] = false;
@@ -511,7 +511,7 @@ pg_walfile_name(PG_FUNCTION_ARGS)
 						 "pg_walfile_name()")));
 
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
+	XLogFileName(xlogfilename, ThisTimeLineIDChecked, xlogsegno, wal_segment_size);
 
 	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd939..802517c881 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -737,7 +737,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * it looked up the timeline. There's nothing we can do about it if
 	 * StartupXLOG() renames it to .partial concurrently.
 	 */
-	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	if (state->currTLI == ThisTimeLineIDChecked && wantPage >= lastReadPage)
 	{
 		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
 		return;
@@ -749,7 +749,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * the current segment we can just keep reading.
 	 */
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
-		state->currTLI != ThisTimeLineID &&
+		state->currTLI != ThisTimeLineIDChecked &&
 		state->currTLI != 0 &&
 		((wantPage + wantLength) / state->segcxt.ws_segsize) <
 		(state->currTLIValidUntil / state->segcxt.ws_segsize))
@@ -772,7 +772,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 * We need to re-read the timeline history in case it's been changed
 		 * by a promotion or replay from a cascaded replica.
 		 */
-		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineIDChecked);
 		XLogRecPtr	endOfSegment;
 
 		endOfSegment = ((wantPage / state->segcxt.ws_segsize) + 1) *
@@ -870,8 +870,11 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
+		{
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		tli = ThisTimeLineID;
+			ThisTimeLineIDValid = true;
+		}
+		tli = ThisTimeLineIDChecked;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -899,7 +902,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 */
 		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 
-		if (state->currTLI == ThisTimeLineID)
+		if (state->currTLI == ThisTimeLineIDChecked)
 		{
 
 			if (loc <= read_upto)
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b31c36d918..7d6370d399 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -502,9 +502,9 @@ perform_base_backup(basebackup_options *opt)
 		 * including them.
 		 */
 		XLByteToSeg(startptr, startsegno, wal_segment_size);
-		XLogFileName(firstoff, ThisTimeLineID, startsegno, wal_segment_size);
+		XLogFileName(firstoff, ThisTimeLineIDChecked, startsegno, wal_segment_size);
 		XLByteToPrevSeg(endptr, endsegno, wal_segment_size);
-		XLogFileName(lastoff, ThisTimeLineID, endsegno, wal_segment_size);
+		XLogFileName(lastoff, ThisTimeLineIDChecked, endsegno, wal_segment_size);
 
 		dir = AllocateDir("pg_wal");
 		while ((de = ReadDir(dir, "pg_wal")) != NULL)
@@ -528,7 +528,7 @@ perform_base_backup(basebackup_options *opt)
 		 * Before we go any further, check that none of the WAL segments we
 		 * need were removed.
 		 */
-		CheckXLogRemoved(startsegno, ThisTimeLineID);
+		CheckXLogRemoved(startsegno, ThisTimeLineIDChecked);
 
 		/*
 		 * Sort the WAL filenames.  We want to send the files in order from
@@ -555,7 +555,7 @@ perform_base_backup(basebackup_options *opt)
 		{
 			char		startfname[MAXFNAMELEN];
 
-			XLogFileName(startfname, ThisTimeLineID, startsegno,
+			XLogFileName(startfname, ThisTimeLineIDChecked, startsegno,
 						 wal_segment_size);
 			ereport(ERROR,
 					(errmsg("could not find WAL file \"%s\"", startfname)));
@@ -571,7 +571,7 @@ perform_base_backup(basebackup_options *opt)
 			{
 				char		nextfname[MAXFNAMELEN];
 
-				XLogFileName(nextfname, ThisTimeLineID, nextsegno,
+				XLogFileName(nextfname, ThisTimeLineIDChecked, nextsegno,
 							 wal_segment_size);
 				ereport(ERROR,
 						(errmsg("could not find WAL file \"%s\"", nextfname)));
@@ -581,7 +581,7 @@ perform_base_backup(basebackup_options *opt)
 		{
 			char		endfname[MAXFNAMELEN];
 
-			XLogFileName(endfname, ThisTimeLineID, endsegno, wal_segment_size);
+			XLogFileName(endfname, ThisTimeLineIDChecked, endsegno, wal_segment_size);
 			ereport(ERROR,
 					(errmsg("could not find WAL file \"%s\"", endfname)));
 		}
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index e59939aad1..379dc0c1d9 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -214,7 +214,11 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
+	{
+		Assert(ThisTimeLineIDValid);
 		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		ThisTimeLineIDValid = true;
+	}
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 17df99c2ac..a3f8c35aec 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -627,7 +627,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
+	{
+		Assert(ThisTimeLineIDValid);
 		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		ThisTimeLineIDValid = true;
+	}
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b90e5ca98e..0ac4e29fd6 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -395,6 +395,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
+		ThisTimeLineIDValid = true;
 		if (walrcv_startstreaming(wrconn, &options))
 		{
 			if (first_stream)
@@ -893,7 +894,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
-			recvFileTLI = ThisTimeLineID;
+			recvFileTLI = ThisTimeLineIDChecked;
 		}
 
 		/* Calculate the start offset of the received logs */
@@ -972,7 +973,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->receivedTLI = ThisTimeLineIDChecked;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b811a5c0ef..73a6e06e6b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -440,7 +440,7 @@ IdentifySystem(void)
 	values[0] = CStringGetTextDatum(sysid);
 
 	/* column 2: timeline */
-	values[1] = Int32GetDatum(ThisTimeLineID);
+	values[1] = Int32GetDatum(ThisTimeLineIDChecked);
 
 	/* column 3: wal location */
 	values[2] = CStringGetTextDatum(xloc);
@@ -628,7 +628,7 @@ StartReplication(StartReplicationCmd *cmd)
 		XLogRecPtr	switchpoint;
 
 		sendTimeLine = cmd->timeline;
-		if (sendTimeLine == ThisTimeLineID)
+		if (sendTimeLine == ThisTimeLineIDChecked)
 		{
 			sendTimeLineIsHistoric = false;
 			sendTimeLineValidUpto = InvalidXLogRecPtr;
@@ -643,7 +643,7 @@ StartReplication(StartReplicationCmd *cmd)
 			 * Check that the timeline the client requested exists, and the
 			 * requested start location is on that timeline.
 			 */
-			timeLineHistory = readTimeLineHistory(ThisTimeLineID);
+			timeLineHistory = readTimeLineHistory(ThisTimeLineIDChecked);
 			switchpoint = tliSwitchPoint(cmd->timeline, timeLineHistory,
 										 &sendTimeLineNextTLI);
 			list_free_deep(timeLineHistory);
@@ -682,7 +682,7 @@ StartReplication(StartReplicationCmd *cmd)
 	}
 	else
 	{
-		sendTimeLine = ThisTimeLineID;
+		sendTimeLine = ThisTimeLineIDChecked;
 		sendTimeLineValidUpto = InvalidXLogRecPtr;
 		sendTimeLineIsHistoric = false;
 	}
@@ -812,7 +812,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
-	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
+	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineIDChecked);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
@@ -945,7 +945,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	/* setup state for WalSndSegmentOpen */
 	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
+	sendTimeLine = ThisTimeLineID; // XXX ThisTimeLineIDChecked fails assertion, ThisTimeLineID = 0!!!
 
 	if (cmd->kind == REPLICATION_KIND_PHYSICAL)
 	{
@@ -2618,7 +2618,7 @@ XLogSendPhysical(void)
 			 * still the one recovery is recovering from? ThisTimeLineID was
 			 * updated by the GetStandbyFlushRecPtr() call above.
 			 */
-			if (sendTimeLine != ThisTimeLineID)
+			if (sendTimeLine != ThisTimeLineIDChecked)
 				becameHistoric = true;
 		}
 
@@ -2631,7 +2631,7 @@ XLogSendPhysical(void)
 			 */
 			List	   *history;
 
-			history = readTimeLineHistory(ThisTimeLineID);
+			history = readTimeLineHistory(ThisTimeLineIDChecked);
 			sendTimeLineValidUpto = tliSwitchPoint(sendTimeLine, history, &sendTimeLineNextTLI);
 
 			Assert(sendTimeLine < sendTimeLineNextTLI);
@@ -2989,9 +2989,10 @@ GetStandbyFlushRecPtr(void)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
+	ThisTimeLineIDValid = true;
 
 	result = replayPtr;
-	if (receiveTLI == ThisTimeLineID && receivePtr > replayPtr)
+	if (receiveTLI == ThisTimeLineIDChecked && receivePtr > replayPtr)
 		result = receivePtr;
 
 	return result;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e2c94a05f..317e110488 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -29,8 +29,12 @@
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
 extern int	sync_method;
 
+extern PGDLLIMPORT bool ThisTimeLineIDValid;
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
 
+#define ThisTimeLineIDChecked \
+	(AssertMacro(ThisTimeLineIDValid), ThisTimeLineID)
+
 /*
  * Recovery target type.
  * Only set during a Point in Time recovery, not when in standby mode.
-- 
2.24.3 (Apple Git-128)

#181

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Robert Haas (#180)

4 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Oct 19, 2021 at 3:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 18, 2021 at 9:54 AM Amul Sul <sulamul@gmail.com> wrote:

I tried this in the attached version, but I'm a bit skeptical with
changes that are needed for CreateCheckPoint(), those don't seem to be
clean.

Yeah, that doesn't look great. I don't think it's entirely correct,
actually, because surely you want LocalXLogInsertAllowed = 0 to be
executed even if !IsPostmasterEnvironment. It's only
LocalXLogInsertAllowed = -1 that we would want to have depend on
IsPostmasterEnvironment. But that's pretty ugly too: I guess the
reason it has to be like is that, if it does that unconditionally, it
will overwrite the temporary value of 1 set by the caller, which will
then cause problems when the caller tries to XLogReportParameters().

I think that problem goes away if we drive the decision off of shared
state rather than a local variable, but I agree that it's otherwise a
bit tricky to untangle. One idea might be to have
LocalSetXLogInsertAllowed return the old value. Then we could use the
same kind of coding we do when switching memory contexts, where we
say:

oldcontext = MemoryContextSwitchTo(something);
// do stuff
MemoryContextSwitchTo(oldcontext);

Here we could maybe do:

oldxlallowed = LocalSetXLogInsertAllowed();
// do stuff
XLogInsertAllowed = oldxlallowed;

Ok, did the same in the attached 0001 patch.

There is no harm in calling LocalSetXLogInsertAllowed() calling
multiple times, but the problem I can see is that with this patch user
is allowed to call LocalSetXLogInsertAllowed() at the time it is
supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
WAL writes are explicitly disabled.

That way, instead of CreateCheckPoint() knowing under what
circumstances the caller might have changed the value, it only knows
that some callers might have already changed the value. That seems
better.

I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
good, but I am not sure about calling it from XLogInsertAllowed()
either, perhaps, both are status check function and general
expectations might be that status checking functions are not going
change and/or initialize the system state. InitXLOGAccess() should
get called from the very first WAL write operation if needed, but if
we don't want to do that, then I would prefer to call InitXLOGAccess()
from XLogInsertAllowed() instead of RecoveryInProgress().

Well, that's a fair point, too, but it might not be safe to, say, move
this to XLogBeginInsert(). Like, imagine that there's a hypothetical
piece of code that looks like this:

if (RecoveryInProgress())
ereport(ERROR, errmsg("can't do that in recovery")));

// do something here that depends on ThisTimeLineID or
wal_segment_size or RedoRecPtr

XLogBeginInsert();
....
lsn = XLogInsert(...);

Such code would work correctly the way things are today, but if the
InitXLOGAccess() call were deferred until XLogBeginInsert() time, then
it would fail.

I was curious whether this is just a theoretical problem. It turns out
that it's not. I wrote a couple of just-for-testing patches, which I
attach here. The first one just adjusts things so that we'll fail an
assertion if we try to make use of ThisTimeLineID before we've set it
to a legal value. I had to exempt two places from these checks just
for 'make check-world' to pass; these are shown in the patch, and one
or both of them might be existing bugs -- or maybe not, I haven't
looked too deeply. The second one then adjusts the patch to pretend
that ThisTimeLineID is not necessarily valid just because we've called
InitXLOGAccess() but that it is valid after XLogBeginInsert(). With
that change, I find about a dozen places where, apparently, the early
call to InitXLOGAccess() is critical to getting ThisTimeLineID
adjusted in time. So apparently a change of this type is not entirely
trivial. And this is just a quick test, and just for one of the three
things that get initialized here.

On the other hand, just moving it to XLogInsertAllowed() isn't
risk-free either and would likely require adjusting some of the same
places I found with this test. So I guess if we want to do something
like this we need more study.

Yeah, that requires a lot of energy and time -- not done anything
related to this in the attached version.

Please have a look at the attached version where the 0001 patch does
change LocalSetXLogInsertAllowed() as said before. 0002 patch moves
XLogReportParameters() closer to other wal write operations and
removes unnecessary LocalSetXLogInsertAllowed() calls. 0003 is code
movements adds XLogAcceptWrites() function same as the before, and
0004 patch tries to remove the dependency. 0004 patch could change
w.r.t. decision that is going to be made for the patch that I
posted[1] to remove abortedRecPtr global variable. For now, I have
copied abortedRecPtr into shared memory. Thanks !

1] /messages/by-id/CAAJ_b94Y75ZwMim+gxxexVwf_yzO-dChof90ky0dB2GstspNjA@mail.gmail.com

Regards,
Amul

Attachments:

v39-0004-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v39-0004-Remove-dependencies-on-startup-process-specifica.patchDownload

From 1173a55fab286cd1a8eb9ae3ae35fe9c2ad5cd0f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v39 4/4] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr & ArchiveRecoveryRequested
is made accessible by copying into shared memory.  missingContrecPtr
can get from the existing shared memory values where it get stored
through EndOfLog if valid.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively.  XLogCtl->lastSegSwitchLSN is not going to change until
we use it. That changes only when the current WAL segment gets full
which never going to happen because of two reasons, first WAL writes
are disabled for other processes until XLogAcceptWrites() finishes and
other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is
writes fix size wal records as full-page write and record for either
recovery end or checkpoint which not going to fill up the 16MB wal
segment.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 83 ++++++++++++++++++++++++-------
 1 file changed, 64 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e5375cd1bb5..5b12addc228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -668,6 +668,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -717,6 +724,13 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -889,8 +903,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -939,7 +952,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -5267,7 +5280,9 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -5548,6 +5563,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5739,8 +5759,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+CleanupAfterArchiveRecovery(void)
 {
+	XLogRecPtr	EndOfLog;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5757,6 +5779,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
+	(void) GetLastSegSwitchData(&EndOfLog);
 	RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
 
 	/*
@@ -5791,6 +5814,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	{
 		char		origfname[MAXFNAMELEN];
 		XLogSegNo	endLogSegNo;
+		TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 		XLByteToPrevSeg(EndOfLog, endLogSegNo, wal_segment_size);
 		XLogFileName(origfname, EndOfLogTLI, endLogSegNo, wal_segment_size);
@@ -7965,6 +7989,16 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8063,8 +8097,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager
+	 * writes cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8124,7 +8165,7 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
 	int			oldXLogAllowed;
@@ -8132,20 +8173,24 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	oldXLogAllowed = LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(XLogCtl->SharedAbortedRecPtr))
 	{
+		/*
+		 * Restore missingContrecPtr, needed to set
+		 * XLP_FIRST_IS_OVERWRITE_CONTRECORD flag on the page header where
+		 * overwrite-contrecord get written. See AdvanceXLInsertBuffer().
+		 *
+		 * NB: We can safely use lastSegSwitchLSN to restore missingContrecPtr,
+		 * which is never going to change until we reach here since there wasn't
+		 * any wal write before.
+		 */
+		GetLastSegSwitchData(&missingContrecPtr);
 		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
+		CreateOverwriteContrecordRecord(XLogCtl->SharedAbortedRecPtr);
+		XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
@@ -8169,7 +8214,7 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8304,8 +8349,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v39-0001-Changed-LocalSetXLogInsertAllowed-to-return-prev.patchapplication/x-patch; name=v39-0001-Changed-LocalSetXLogInsertAllowed-to-return-prev.patchDownload

From fc9ff858b2a56c6a75f56d88562d8cab821687db Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 22 Oct 2021 09:08:04 -0400
Subject: [PATCH v39 1/4] Changed LocalSetXLogInsertAllowed() to return
 previous value

---
 src/backend/access/transam/xlog.c | 32 +++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 62862255fca..23a3f35e1fe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -905,7 +905,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 								TimeLineID prevTLI);
 static void VerifyOverwriteContrecord(xl_overwrite_contrecord *xlrec,
 									  XLogReaderState *state);
-static void LocalSetXLogInsertAllowed(void);
+static int LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static XLogRecPtr CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
@@ -6635,6 +6635,7 @@ StartupXLOG(void)
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
 	bool		promoted = false;
+	int			oldXLogAllowed;
 	struct stat st;
 
 	/*
@@ -8062,7 +8063,7 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	LocalSetXLogInsertAllowed();
+	oldXLogAllowed = LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
 	if (!XLogRecPtrIsInvalid(abortedRecPtr))
@@ -8080,7 +8081,7 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalXLogInsertAllowed = oldXLogAllowed;
 
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
@@ -8102,7 +8103,7 @@ StartupXLOG(void)
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
-	LocalSetXLogInsertAllowed();
+	(void) LocalSetXLogInsertAllowed();
 	XLogReportParameters();
 
 	/*
@@ -8463,19 +8464,20 @@ XLogInsertAllowed(void)
 }
 
 /*
- * Make XLogInsertAllowed() return true in the current process only.
- *
- * Note: it is allowed to switch LocalXLogInsertAllowed back to -1 later,
- * and even call LocalSetXLogInsertAllowed() again after that.
+ * Sets LocalXLogInsertAllowed to make XLogInsertAllowed() return true in the
+ * current process only and returns previous LocalXLogInsertAllowed value.
  */
-static void
+static int
 LocalSetXLogInsertAllowed(void)
 {
-	Assert(LocalXLogInsertAllowed == -1);
+	int		oldXLogAllowed = LocalXLogInsertAllowed;
+
 	LocalXLogInsertAllowed = 1;
 
 	/* Initialize as RecoveryInProgress() would do when switching state */
 	InitXLOGAccess();
+
+	return oldXLogAllowed;
 }
 
 /*
@@ -9020,6 +9022,7 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
+	int			oldXLogAllowed;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -9127,7 +9130,7 @@ CreateCheckPoint(int flags)
 	 * initialized, which we need here and in AdvanceXLInsertBuffer.)
 	 */
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
-		LocalSetXLogInsertAllowed();
+		oldXLogAllowed = LocalSetXLogInsertAllowed();
 
 	checkPoint.ThisTimeLineID = ThisTimeLineID;
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
@@ -9307,7 +9310,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalXLogInsertAllowed = oldXLogAllowed;
 		else
 			LocalXLogInsertAllowed = 0; /* never again write WAL */
 	}
@@ -9435,6 +9438,7 @@ CreateEndOfRecoveryRecord(void)
 {
 	xl_end_of_recovery xlrec;
 	XLogRecPtr	recptr;
+	int			oldXLogAllowed;
 
 	/* sanity check */
 	if (!RecoveryInProgress())
@@ -9447,7 +9451,7 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
-	LocalSetXLogInsertAllowed();
+	oldXLogAllowed = LocalSetXLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -9470,7 +9474,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalXLogInsertAllowed = oldXLogAllowed;
 }
 
 /*
-- 
2.18.0

v39-0002-Minimize-LocalSetXLogInsertAllowed-calls-by-movi.patchapplication/x-patch; name=v39-0002-Minimize-LocalSetXLogInsertAllowed-calls-by-movi.patchDownload

From 254763823f76ceba15610eb745372cae2a50878a Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 22 Oct 2021 09:17:34 -0400
Subject: [PATCH v39 2/4] Minimize LocalSetXLogInsertAllowed() calls by moving
 XLogReportParameters().

Also, remove LocalSetXLogInsertAllowed() call from
CreateEndOfRecoveryRecord() which is not longer need since we not
resetting LocalXLogInsertAllowed it.
---
 src/backend/access/transam/xlog.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 23a3f35e1fe..d58b0ce0c71 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8081,7 +8081,6 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = oldXLogAllowed;
 
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
@@ -8095,16 +8094,16 @@ StartupXLOG(void)
 	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
 		promoted = PerformRecoveryXLogAction();
 
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
 	 * backends to write WAL.
 	 */
-	(void) LocalSetXLogInsertAllowed();
 	XLogReportParameters();
+	LocalXLogInsertAllowed = oldXLogAllowed;
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -9438,7 +9437,6 @@ CreateEndOfRecoveryRecord(void)
 {
 	xl_end_of_recovery xlrec;
 	XLogRecPtr	recptr;
-	int			oldXLogAllowed;
 
 	/* sanity check */
 	if (!RecoveryInProgress())
@@ -9451,8 +9449,6 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
-	oldXLogAllowed = LocalSetXLogInsertAllowed();
-
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9473,8 +9469,6 @@ CreateEndOfRecoveryRecord(void)
 	LWLockRelease(ControlFileLock);
 
 	END_CRIT_SECTION();
-
-	LocalXLogInsertAllowed = oldXLogAllowed;
 }
 
 /*
-- 
2.18.0

v39-0003-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v39-0003-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 5cbffa95285a8e3ce272afb6eca8e34b71f56169 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v39 3/4] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 111 +++++++++++++++++-------------
 1 file changed, 63 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d58b0ce0c71..e5375cd1bb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -939,6 +939,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -6635,7 +6636,6 @@ StartupXLOG(void)
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
 	bool		promoted = false;
-	int			oldXLogAllowed;
 	struct stat st;
 
 	/*
@@ -8063,53 +8063,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	oldXLogAllowed = LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 *
-	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
-	 * entered recovery. Even if we ultimately replayed no WAL records, it will
-	 * have been initialized based on where replay was due to start.  We don't
-	 * need a lock to access this, since this can't change any more by the time
-	 * we reach this code.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-	LocalXLogInsertAllowed = oldXLogAllowed;
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8165,6 +8120,66 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	bool		promoted = false;
+	int			oldXLogAllowed;
+
+	oldXLogAllowed = LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
+	 * entered recovery. Even if we ultimately replayed no WAL records, it will
+	 * have been initialized based on where replay was due to start.  We don't
+	 * need a lock to access this, since this can't change any more by the time
+	 * we reach this code.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+	LocalXLogInsertAllowed = oldXLogAllowed;
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#182

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Amul Sul (#181)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote:

Ok, did the same in the attached 0001 patch.

There is no harm in calling LocalSetXLogInsertAllowed() calling
multiple times, but the problem I can see is that with this patch user
is allowed to call LocalSetXLogInsertAllowed() at the time it is
supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
WAL writes are explicitly disabled.

I've pushed 0001 and 0002 but I reversed the order of them and made a
few other edits.

I don't really see the issue you mention here as a problem. There's
only one place where we set LocalXLogInsertAllowed = 0, and I don't
know that we'll ever have another one.

--
Robert Haas
EDB: http://www.enterprisedb.com

#183

Bossart, Nathan

bossartn@amazon.com

about 4 years ago

In reply to: Robert Haas (#182)

Re: [Patch] ALTER SYSTEM READ ONLY

On 10/25/21, 7:50 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:

I've pushed 0001 and 0002 but I reversed the order of them and made a
few other edits.

My compiler is complaining about oldXLogAllowed possibly being used
uninitialized in CreateCheckPoint(). AFAICT it can just be initially
set to zero to silence this warning because it will, in fact, be
initialized properly when it is used.

Nathan

#184

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Bossart, Nathan (#183)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan <bossartn@amazon.com> wrote:

My compiler is complaining about oldXLogAllowed possibly being used
uninitialized in CreateCheckPoint(). AFAICT it can just be initially
set to zero to silence this warning because it will, in fact, be
initialized properly when it is used.

Hmm, I guess I could have foreseen that, had I been a little bit
smarter than I am. I have committed a change to initialize it to 0 as
you propose.

--
Robert Haas
EDB: http://www.enterprisedb.com

#185

Bossart, Nathan

bossartn@amazon.com

about 4 years ago

In reply to: Robert Haas (#184)

Re: [Patch] ALTER SYSTEM READ ONLY

On 10/25/21, 1:33 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan <bossartn@amazon.com> wrote:

My compiler is complaining about oldXLogAllowed possibly being used
uninitialized in CreateCheckPoint(). AFAICT it can just be initially
set to zero to silence this warning because it will, in fact, be
initialized properly when it is used.

Hmm, I guess I could have foreseen that, had I been a little bit
smarter than I am. I have committed a change to initialize it to 0 as
you propose.

Thanks!

Nathan

#186

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Robert Haas (#182)

2 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Oct 25, 2021 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote:

Ok, did the same in the attached 0001 patch.

There is no harm in calling LocalSetXLogInsertAllowed() calling
multiple times, but the problem I can see is that with this patch user
is allowed to call LocalSetXLogInsertAllowed() at the time it is
supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
WAL writes are explicitly disabled.

I've pushed 0001 and 0002 but I reversed the order of them and made a
few other edits.

Thank you!

I have rebased the remaining patches on top of the latest master head
(commit # e63ce9e8d6a).

In addition to that, I did the additional changes to 0002 where I
haven't included the change that tries to remove arguments of
CleanupAfterArchiveRecovery() in the previous version. Because if we
want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to
replace EndOfLogTLI and EndOfLog arguments respectively, then we also
need to consider the case where EndOfLog is changing if the
abort-record does exist. That can be decided only in XLogAcceptWrite()
before the shared memory value related to abort-record is going to be
clear.

Regards,
Amul

Attachments:

v40-0002-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v40-0002-Remove-dependencies-on-startup-process-specifica.patchDownload

From 14bfb4a91c8935efd17b50a0a0687101481b15c7 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v40 2/2] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr & ArchiveRecoveryRequested
is made accessible by copying into shared memory.  missingContrecPtr
can get from the existing shared memory values through
XLogCtl->lastSegSwitchLSN, which is not going to change until we use
it. That changes only when the current WAL segment gets full, there
won't be any WAL write until that point.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively. EndOfLog will be changed if the abort record is exists
and in that case, the missingContrecPtr point will be considered as
the end of WAL since the further part going to be skipped anyway.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 82 +++++++++++++++++++++++++------
 1 file changed, 66 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d72e1967d4..10159b3312b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -669,6 +669,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -718,6 +725,13 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -940,7 +954,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -5268,7 +5282,9 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -5549,6 +5565,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -7979,6 +8000,16 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8077,8 +8108,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager writes
+	 * cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8138,29 +8176,41 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	XLogRecPtr	EndOfLog = XLogCtl->replayEndRecPtr;
+	TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(XLogCtl->SharedAbortedRecPtr))
 	{
+		/*
+		 * Restore missingContrecPtr, needed to set
+		 * XLP_FIRST_IS_OVERWRITE_CONTRECORD flag on the page header where
+		 * overwrite-contrecord get written. See AdvanceXLInsertBuffer().
+		 *
+		 * NB: We can safely use lastSegSwitchLSN to restore missingContrecPtr,
+		 * which is never going to change until we reach here since there wasn't
+		 * any wal write before.
+		 */
+		GetLastSegSwitchData(&missingContrecPtr);
 		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
+
+		/*
+		 * In case of abort record, the actual end of WAL will be the missing
+		 * contrecord since the rest further part will be skipped.
+		 */
+		EndOfLog = missingContrecPtr;
+
+		CreateOverwriteContrecordRecord(XLogCtl->SharedAbortedRecPtr);
+		XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
@@ -8318,8 +8368,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v40-0001-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v40-0001-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From a132586586b3775904117d14b2365f06026fd91f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v40 1/2] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 110 +++++++++++++++++-------------
 1 file changed, 63 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f547efd2944..8d72e1967d4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -940,6 +940,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -8076,53 +8077,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 *
-	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
-	 * entered recovery. Even if we ultimately replayed no WAL records, it will
-	 * have been initialized based on where replay was due to start.  We don't
-	 * need a lock to access this, since this can't change any more by the time
-	 * we reach this code.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8178,6 +8134,66 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
+{
+	bool		promoted = false;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
+	 * entered recovery. Even if we ultimately replayed no WAL records, it will
+	 * have been initialized based on where replay was due to start.  We don't
+	 * need a lock to access this, since this can't change any more by the time
+	 * we reach this code.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#187

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Amul Sul (#186)

7 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Oct 26, 2021 at 4:29 PM Amul Sul <sulamul@gmail.com> wrote:

On Mon, Oct 25, 2021 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote:

Ok, did the same in the attached 0001 patch.

There is no harm in calling LocalSetXLogInsertAllowed() calling
multiple times, but the problem I can see is that with this patch user
is allowed to call LocalSetXLogInsertAllowed() at the time it is
supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
WAL writes are explicitly disabled.

I've pushed 0001 and 0002 but I reversed the order of them and made a
few other edits.

Thank you!

I have rebased the remaining patches on top of the latest master head
(commit # e63ce9e8d6a).

In addition to that, I did the additional changes to 0002 where I
haven't included the change that tries to remove arguments of
CleanupAfterArchiveRecovery() in the previous version. Because if we
want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to
replace EndOfLogTLI and EndOfLog arguments respectively, then we also
need to consider the case where EndOfLog is changing if the
abort-record does exist. That can be decided only in XLogAcceptWrite()
before the shared memory value related to abort-record is going to be
clear.

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I was planning to attach the rebased version of isolation test patches
that Mark has posted before[1], but some permutation tests are not
stable, where expected errors get printed differently; therefore, I
dropped that from the attachment, for now.

Regards,
Amul

1] /messages/by-id/9BA3BA57-6B7B-45CB-B8D9-6B5EB0104FFA@enterprisedb.com

Attachments:

v41-0003-Allow-RequestCheckpoint-call-from-checkpointer-p.patchapplication/octet-stream; name=v41-0003-Allow-RequestCheckpoint-call-from-checkpointer-p.patchDownload

From fc49ab136354077d2031ee081f95b4ac14e14201 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 21 Sep 2021 06:05:36 -0400
Subject: [PATCH v41 3/7] Allow RequestCheckpoint() call from checkpointer
 process

---
 src/backend/postmaster/checkpointer.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..3ba9d795818 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -924,9 +924,11 @@ RequestCheckpoint(int flags)
 				old_started;
 
 	/*
-	 * If in a standalone backend, just do it ourselves.
+	 * If in a standalone backend or checkpointer with wait for completion flag,
+	 * just do it ourselves.
 	 */
-	if (!IsPostmasterEnvironment)
+	if (!IsPostmasterEnvironment ||
+		(AmCheckpointerProcess() && (flags & CHECKPOINT_WAIT)))
 	{
 		/*
 		 * There's no point in doing slow checkpoints in a standalone backend,
-- 
2.18.0

v41-0007-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/octet-stream; name=v41-0007-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From 71e45afccdfae5f049507490c8bb352ff3f86502 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v41 7/7] Test: Few tap tests for wal prohibited system

Does following testing:

1. Basic verification like insert into normal and unlogged table on
   wal prohibited system.
2. Check permission to non-superuser to alter wal prohibited system
   state.
3. Verify open write transaction disconnection when system state has
   been changed to wal prohibited.
4. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
5. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
6. Standby server cannot be in wal prohibited, standby.signal and/or
   recovery.signal take out system from wal prohibited state.
7. Terminate session running transaction performed write but not
   committed yet while changing state to WAL prohibited.
---
 src/test/recovery/t/027_pg_prohibit_wal.pl | 214 +++++++++++++++++++++
 1 file changed, 214 insertions(+)
 create mode 100644 src/test/recovery/t/027_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/027_pg_prohibit_wal.pl b/src/test/recovery/t/027_pg_prohibit_wal.pl
new file mode 100644
index 00000000000..059982ba7d0
--- /dev/null
+++ b/src/test/recovery/t/027_pg_prohibit_wal.pl
@@ -0,0 +1,214 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test wal prohibited state.
+use strict;
+use warnings;
+use FindBin;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 22;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize database node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Create few tables and insert some data
+$node_primary->safe_psql('postgres',  <<EOSQL);
+CREATE TABLE tab AS SELECT i FROM generate_series(1,5) i;
+CREATE UNLOGGED TABLE unlogtab AS SELECT i FROM generate_series(1,5) i;
+EOSQL
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is now wal prohibited');
+
+#
+# In wal prohibited state, further table insert will fail.
+#
+# Note that even though inter into unlogged and temporary table doesn't generate
+# wal but the transaction does that insert operation will acquire transaction id
+# which is not allowed on wal prohibited system. Also, that transaction's abort
+# or commit state will be wal logged at the end which is prohibited as well.
+#
+my ($stdout, $stderr, $timed_out);
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, table insert is failed');
+$node_primary->psql('postgres', 'INSERT INTO unlogtab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, unlogged table insert is failed');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on server, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint, last checkpoint lsn : $checkpoint_lsn");
+
+# Change server to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+is($new_checkpoint_lsn ne $checkpoint_lsn, 1,
+	"new checkpoint performed, new checkpoint lsn : $new_checkpoint_lsn");
+
+my $new_write_lsn = $node_primary->lsn('write');
+is($new_write_lsn ne $write_lsn, 1,
+	"new wal writes on server, new latest wal write lsn : $new_write_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '6',
+	'table insert passed');
+
+# Only the superuser and the user who granted permission able to call
+# pg_prohibit_wal to change wal prohibited state.
+$node_primary->safe_psql('postgres', 'CREATE USER non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+like($stderr, qr/permission denied for function pg_prohibit_wal/,
+	'permission denied to non-superuser for alter wal prohibited state');
+$node_primary->safe_psql('postgres', 'GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'granted permission to non-superuser, able to alter wal prohibited state');
+
+# back to normal state
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(false)');
+
+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$mysession_stdin,
+	'>',
+	\$mysession_stdout,
+	'2>',
+	\$mysession_stderr,
+	$psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(7);
+SELECT $$value-7-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-7-inserted-into-tab/,
+	'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+	'session with open write transaction is terminated');
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Recovery server cannot be put into wal prohibited state.
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(6)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(6)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '7',
+	'insert passed on primary');
+
+# Wait for standbys to catch up
+$node_primary->wait_for_catchup($node_standby, 'write');
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '7',
+	'new insert replicated on standby as well');
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v41-0006-Documentation.patchapplication/octet-stream; name=v41-0006-Documentation.patchDownload

From a0187b53f686f10d2baeb0cc82f7a783095ffcde Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v41 6/7] Documentation.

---
 doc/src/sgml/func.sgml              | 20 ++++++++++
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 doc/src/sgml/monitoring.sgml        |  4 ++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 5 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 24447c00177..b4158dbc30f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25407,6 +25407,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c43f2140205..98b660941b1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3173ec25660..54e3a1017b4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1549,6 +1549,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>SystemWALProhibitStateChange</literal></entry>
+      <entry>Waiting for a wal prohibited state change.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v41-0004-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v41-0004-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 5968f2e4644ef4b32c62a3c064ba15b8a49e883d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v41 4/7] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 480 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 176 ++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   8 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  21 +
 src/backend/storage/ipc/ipci.c           |   7 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  31 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  59 +++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   3 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 854 insertions(+), 71 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..49404f45a16
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,480 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool HoldWALProhibitStateTransition = false;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ *
+ *	NB: Function always returns true that leaves scope for the future code
+ *	changes might need to return false for some reason.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_BOOL(true);
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_BOOL(true);		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/* Quick exit if the state transition is on hold */
+	if (HoldWALProhibitStateTransition)
+		return;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	while (1)
+	{
+		WALProhibitState cur_state;
+
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.  While doing that, hold off state
+				 * transition to avoid a recursive call to process wal
+				 * prohibit state transition from the end-of-recovery
+				 * checkpoint.
+				 */
+				if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+				{
+					HoldWALProhibitStateTransition = true;
+					PerformPendingXLogAcceptWrites();
+					HoldWALProhibitStateTransition = false;
+				}
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				return;			/* Done */
+		}
+	}
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8e35c432f5c..7a6afea9f3f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2013,23 +2013,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cbd415a5cfe..bb8848b9a18 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -239,9 +240,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -744,6 +746,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5068,6 +5076,17 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+	void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5346,6 +5365,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
 	XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6501,6 +6521,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6872,13 +6901,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8180,8 +8226,29 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 
-	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites();
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		promoted = XLogAcceptWrites();
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8248,6 +8315,20 @@ XLogAcceptWrites(void)
 	TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
 	TimeLineID	ThisTimeLineID = XLogCtl->ThisTimeLineID;
 
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
+
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
@@ -8307,9 +8388,41 @@ XLogAcceptWrites(void)
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	ResetLocalXLogInsertAllowed();
+
+	/* Prepare to accept WAL writes. */
+	(void) XLogAcceptWrites();
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+ }
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8578,9 +8691,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8599,9 +8712,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8628,6 +8752,12 @@ LocalSetXLogInsertAllowed(void)
 	return oldXLogAllowed;
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8929,9 +9059,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8944,6 +9078,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9194,8 +9331,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Initialize InitXLogInsert working areas before entering the critical
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 54c93b16c4c..f6d857d9533 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -701,6 +701,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_log_backend_memory_contexts(integer) FROM PUBLIC;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 96332320a73..badf5a51c6b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,12 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the wal writes are permitted.  Second, we
+		 * need to make sure that there is a worker slot available.  Third, we
+		 * need to make sure that no other worker failed while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3ba9d795818..2315fa03657 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -348,6 +349,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -692,6 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1343,3 +1348,19 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows any process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9fa3e0631e6..9b391cb9cc2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -247,6 +248,12 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up shared memory structure need to handle concurrent WAL prohibit
+	 * state change requests.
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26a..166f9fccabe 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c25af7fe090..b595c0db1bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56f..b27625f4845 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -241,9 +242,17 @@ SyncPostCheckpoint(void)
 		entry->canceled = true;
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
-		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop.
+		 * As in ProcessSyncRequests, we don't want to stop processing wal
+		 * prohibit change requests for a long time when there are many
+		 * deletions to be done.  It needs to be check and processed by
+		 * checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -302,6 +311,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -360,6 +372,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop processing wal prohibit change requests for a long
+		 * time when there are many fsync requests to be processed.  It needs to
+		 * be check and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -446,6 +465,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the same
+				 * function call.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index bf085aa93b2..4f4d07ec558 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4a5b7502f5e..130e28b1d68 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -717,6 +717,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfda..67ea808c4b9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -235,6 +236,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -677,6 +679,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2119,6 +2122,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12572,4 +12587,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..ff77a68552c
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,59 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 898df2ee034..c8685142e56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of XLogAcceptWrites() execution */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped XLogAcceptWrites() */
+	XLOG_ACCEPT_WRITES_DONE			/* done with XLogAcceptWrites() */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -279,6 +287,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -287,8 +296,10 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern uint64 GetSystemIdentifier(void);
 extern char *GetMockAuthenticationNonce(void);
 extern bool DataChecksumsEnabled(void);
@@ -299,6 +310,7 @@ extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 749bce0cc6f..19cf88d24ba 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -184,6 +184,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532ec..78da6229168 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11645,6 +11645,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4544', descr => 'change server to permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'bool',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index c22142365f1..b5baec7a868 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -221,7 +221,8 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index da6ac8ed83e..60622118874 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2834,6 +2834,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v41-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v41-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 63a30a4c8c4212dc109153666fd899b5ea6cd485 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v41 5/7] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 +++++--
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++---
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 +++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 ++++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++---
 src/backend/access/heap/visibilitymap.c   | 19 ++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 24 ++++++++++---
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 ++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 30 +++++++++++-----
 src/backend/access/transam/xloginsert.c   | 21 +++++++++--
 src/backend/commands/sequence.c           | 16 +++++++++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 44 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 ++++++++++++++
 38 files changed, 512 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..a3718246588 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index fbccf3d038d..e252b2c22a8 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
 			computeLeafRecompressWALData(leaf);
+			CheckWALPermitted();
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..76630b12490 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..5c7b5fc9e9d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6d2d71be32b..7b321c69880 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..e57e83c8c4d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index eb3810494f2..a47a3dd84cc 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b1..4ea7b1c934f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b312af57e11..197d226f2ec 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+		CheckWALPermitted();
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
 						XLogEnsureRecordSpace(0, 3 + nitups);
+						CheckWALPermitted();
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 159646c7c3e..d1989e93b35 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ec234a5e595..fe63d241f3b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c7331d8108e..1d86226add5 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -228,6 +229,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -282,6 +284,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -315,7 +321,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 558cc88a08a..328fb42f002 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1338,6 +1339,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1353,8 +1359,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1958,8 +1963,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1984,7 +1994,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2411,6 +2421,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2421,6 +2432,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2451,7 +2465,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..6fb0c282486 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -272,12 +274,19 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
+		/*
+		 * Can reach here from VACUUM or from startup process, so need not have an
+		 * XID.
+		 */
+		if (needwal && XLogRecPtrIsInvalid(recptr))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +483,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +497,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +531,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index c88dc6eedbd..9ed8039d730 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -235,6 +236,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0fe8c709395..fb6d0a59055 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1240,6 +1241,7 @@ _bt_insertonpg(Relation rel,
 		Page		metapg = NULL;
 		BTMetaPageData *metad = NULL;
 		BlockNumber blockcache;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		/*
 		 * If we are doing this insert because we split a page that was the
@@ -1265,6 +1267,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1303,7 +1308,7 @@ _bt_insertonpg(Relation rel,
 		}
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
@@ -1488,6 +1493,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	bool		newitemonleft,
 				isleaf,
 				isrightmost;
+	bool		needwal;
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1915,13 +1921,18 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -1958,7 +1969,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_split xlrec;
 		uint8		xlinfo;
@@ -2446,6 +2457,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
@@ -2483,6 +2495,10 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
@@ -2540,7 +2556,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_newroot xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5bc7c3616a9..b4fb0a63091 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4b5f639ce..83a26847497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1164,6 +1165,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2257,6 +2260,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2355,6 +2361,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 49404f45a16..50df565386f 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 7a6afea9f3f..a4687cfffb7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1368,6 +1369,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1728,6 +1731,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bb8848b9a18..05b99744efd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1071,7 +1071,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*
 	 * Given that we're not in recovery, ThisTimeLineID is set and can't
@@ -2951,9 +2951,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9588,6 +9590,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9752,6 +9757,9 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9807,6 +9815,9 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn)
 	xlrec.overwritten_lsn = aborted_lsn;
 	xlrec.overwrite_time = GetCurrentTimestamp();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10453,7 +10464,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10467,10 +10478,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10492,8 +10503,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 689384a411f..2b4b6040050 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -139,9 +140,20 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
-	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
+	/*
+	 * Cross-check on whether we should be here or not.
+	 *
+	 * This check is primarily for a non-critical section that never insists the
+	 * same WAL write permission check before reaching here.
+	 */
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -219,6 +231,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96a..045f3a48da3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3888,13 +3888,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index ff77a68552c..1f8ca692347 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -56,4 +57,47 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	Assert(XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a30160657..b438ec31fc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v41-0002-Remove-dependencies-on-startup-process-specifica.patchapplication/octet-stream; name=v41-0002-Remove-dependencies-on-startup-process-specifica.patchDownload

From 2ddb300040df38e2f8b90d36e613d99798e7a186 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v41 2/7] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr & ArchiveRecoveryRequested
is made accessible by copying into shared memory.  missingContrecPtr
can get from the existing shared memory values through
XLogCtl->lastSegSwitchLSN, which is not going to change until we use
it. That changes only when the current WAL segment gets full, there
won't be any WAL write until that point.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively. EndOfLog will be changed if the abort record is exists
and in that case, the missingContrecPtr point will be considered as
the end of WAL since the further part going to be skipped anyway.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 85 ++++++++++++++++++++++++-------
 1 file changed, 67 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 163503bb87e..cbd415a5cfe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -666,6 +666,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -715,6 +722,13 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -948,8 +962,7 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							 TimeLineID ThisTimeLineID);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -5330,7 +5343,9 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
+	XLogCtl->SharedArchiveRecoveryRequested = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -5609,6 +5624,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -8045,6 +8065,16 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8143,8 +8173,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager writes
+	 * cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, ThisTimeLineID);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8204,30 +8241,42 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-				 TimeLineID ThisTimeLineID)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	XLogRecPtr	EndOfLog = XLogCtl->replayEndRecPtr;
+	TimeLineID	EndOfLogTLI = XLogCtl->replayEndTLI;
+	TimeLineID	ThisTimeLineID = XLogCtl->ThisTimeLineID;
 
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(XLogCtl->SharedAbortedRecPtr))
 	{
+		/*
+		 * Restore missingContrecPtr, needed to set
+		 * XLP_FIRST_IS_OVERWRITE_CONTRECORD flag on the page header where
+		 * overwrite-contrecord get written. See AdvanceXLInsertBuffer().
+		 *
+		 * NB: We can safely use lastSegSwitchLSN to restore missingContrecPtr,
+		 * which is never going to change until we reach here since there wasn't
+		 * any wal write before.
+		 */
+		GetLastSegSwitchData(&missingContrecPtr);
 		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
+
+		/*
+		 * In case of abort record, the actual end of WAL will be the missing
+		 * contrecord since the rest further part will be skipped.
+		 */
+		EndOfLog = missingContrecPtr;
+
+		CreateOverwriteContrecordRecord(XLogCtl->SharedAbortedRecPtr);
+		XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
@@ -8385,8 +8434,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v41-0001-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/octet-stream; name=v41-0001-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 41d611588e862179eb9657406e0099acba253e7d Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v41 1/7] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 112 +++++++++++++++++-------------
 1 file changed, 65 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f8..163503bb87e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -948,6 +948,8 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+							 TimeLineID ThisTimeLineID);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -8141,53 +8143,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 *
-	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
-	 * entered recovery. Even if we ultimately replayed no WAL records, it will
-	 * have been initialized based on where replay was due to start.  We don't
-	 * need a lock to access this, since this can't change any more by the time
-	 * we reach this code.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, ThisTimeLineID);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, ThisTimeLineID);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8243,6 +8200,67 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+				 TimeLineID ThisTimeLineID)
+{
+	bool		promoted = false;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
+	 * entered recovery. Even if we ultimately replayed no WAL records, it will
+	 * have been initialized based on where replay was due to start.  We don't
+	 * need a lock to access this, since this can't change any more by the time
+	 * we reach this code.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, ThisTimeLineID);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#188

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Amul Sul (#187)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though. As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

--
Robert Haas
EDB: http://www.enterprisedb.com

#189

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Robert Haas (#188)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

Another reason could be EndOfLog changes further in the following case:

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

Regards,
Amul

#190

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Amul Sul (#189)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

Another reason could be EndOfLog changes further in the following case:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
XLogFileReadAnyTLI(), I think I could make a sense that there could be
a case where the record belong to TLI 1 we are looking for; we might
open the file with TLI 2. But, I am wondering what's wrong if we say
that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
in its file name -- that statement is still true, and that record
should be still accessible from the filename with TLI 1. Also, if we
going to consider this reading record exists before the timeline
switch point as the EndOfLog then why should be worried about the
latter timeline switch which eventually everything after the EndOfLog
going to be useless for us. We might continue switching TLI and/or
writing the WAL right after EndOfLog, correct me if I am missing
something here.

Further, I still think replayEndTLI has set to the correct value what
we looking for EndOfLogTLI when we go through the redo loop. When it
read the record and finds a change in the current replayTLI then it
updates that as:

if (newReplayTLI != replayTLI)
{
/* Check that it's OK to switch to this TLI */
checkTimeLineSwitch(EndRecPtr, newReplayTLI,
prevReplayTLI, replayTLI);

/* Following WAL records should be run with new TLI */
replayTLI = newReplayTLI;
switchedTLI = true;
}

Then replayEndTLI gets updated. If we going to skip the reread of
"LastRec" that we were discussing, then I think the following code
that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
(or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
should be enough, AFAICU.

Regards,
Amul

#191

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Amul Sul (#190)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

Another reason could be EndOfLog changes further in the following case:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
XLogFileReadAnyTLI(), I think I could make a sense that there could be
a case where the record belong to TLI 1 we are looking for; we might
open the file with TLI 2. But, I am wondering what's wrong if we say
that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
in its file name -- that statement is still true, and that record
should be still accessible from the filename with TLI 1. Also, if we
going to consider this reading record exists before the timeline
switch point as the EndOfLog then why should be worried about the
latter timeline switch which eventually everything after the EndOfLog
going to be useless for us. We might continue switching TLI and/or
writing the WAL right after EndOfLog, correct me if I am missing
something here.

Further, I still think replayEndTLI has set to the correct value what
we looking for EndOfLogTLI when we go through the redo loop. When it
read the record and finds a change in the current replayTLI then it
updates that as:

if (newReplayTLI != replayTLI)
{
/* Check that it's OK to switch to this TLI */
checkTimeLineSwitch(EndRecPtr, newReplayTLI,
prevReplayTLI, replayTLI);

/* Following WAL records should be run with new TLI */
replayTLI = newReplayTLI;
switchedTLI = true;
}

Then replayEndTLI gets updated. If we going to skip the reread of
"LastRec" that we were discussing, then I think the following code
that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
(or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
should be enough, AFAICU.

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

I think I found the right case for this, above TLI fetch is needed in
the case where we do restore from the archived WAL files. In my trial,
the archive directory has files as below (Kindly ignore the extra
history file, I perform a few more trials to be sure):

-rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
-rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
-rw-------. 1 amul amul 128 Nov 17 06:36 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
-rw-------. 1 amul amul 171 Nov 17 06:39 00000005.history
-rw-------. 1 amul amul 209 Nov 17 06:45 00000006.history
-rw-------. 1 amul amul 247 Nov 17 06:52 00000007.history

The timeline is switched in 1F file but the archiver has backup older
timeline file and renamed it. While performing PITR using these
archived files, the .partitial file seems to be skipped from the
restore. The file with the next timeline id is selected to read the
records that belong to the previous timeline id as well (i.e. 4 here,
all the records before timeline switch point). Here is the files
inside pg_wal directory after restore, note that in the current
experiment, I chose recovery_target_xid = <just before the timeline#5
switch point > and then recovery_target_action = 'promote':

-rw-------. 1 amul amul 85 Nov 17 07:33 00000003.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
-rw-------. 1 amul amul 128 Nov 17 07:33 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
-rw-------. 1 amul amul 171 Nov 17 07:33 00000005.history
-rw-------. 1 amul amul 209 Nov 17 07:33 00000006.history
-rw-------. 1 amul amul 247 Nov 17 07:33 00000007.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F

The last one is the new WAL file created in that cluster.

Regards,
Amul

#192

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Amul Sul (#191)

2 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

Another reason could be EndOfLog changes further in the following case:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
XLogFileReadAnyTLI(), I think I could make a sense that there could be
a case where the record belong to TLI 1 we are looking for; we might
open the file with TLI 2. But, I am wondering what's wrong if we say
that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
in its file name -- that statement is still true, and that record
should be still accessible from the filename with TLI 1. Also, if we
going to consider this reading record exists before the timeline
switch point as the EndOfLog then why should be worried about the
latter timeline switch which eventually everything after the EndOfLog
going to be useless for us. We might continue switching TLI and/or
writing the WAL right after EndOfLog, correct me if I am missing
something here.

Further, I still think replayEndTLI has set to the correct value what
we looking for EndOfLogTLI when we go through the redo loop. When it
read the record and finds a change in the current replayTLI then it
updates that as:

if (newReplayTLI != replayTLI)
{
/* Check that it's OK to switch to this TLI */
checkTimeLineSwitch(EndRecPtr, newReplayTLI,
prevReplayTLI, replayTLI);

/* Following WAL records should be run with new TLI */
replayTLI = newReplayTLI;
switchedTLI = true;
}

Then replayEndTLI gets updated. If we going to skip the reread of
"LastRec" that we were discussing, then I think the following code
that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
(or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
should be enough, AFAICU.

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

I think I found the right case for this, above TLI fetch is needed in
the case where we do restore from the archived WAL files. In my trial,
the archive directory has files as below (Kindly ignore the extra
history file, I perform a few more trials to be sure):

-rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
-rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
-rw-------. 1 amul amul 128 Nov 17 06:36 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
-rw-------. 1 amul amul 171 Nov 17 06:39 00000005.history
-rw-------. 1 amul amul 209 Nov 17 06:45 00000006.history
-rw-------. 1 amul amul 247 Nov 17 06:52 00000007.history

The timeline is switched in 1F file but the archiver has backup older
timeline file and renamed it. While performing PITR using these
archived files, the .partitial file seems to be skipped from the
restore. The file with the next timeline id is selected to read the
records that belong to the previous timeline id as well (i.e. 4 here,
all the records before timeline switch point). Here is the files
inside pg_wal directory after restore, note that in the current
experiment, I chose recovery_target_xid = <just before the timeline#5
switch point > and then recovery_target_action = 'promote':

-rw-------. 1 amul amul 85 Nov 17 07:33 00000003.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
-rw-------. 1 amul amul 128 Nov 17 07:33 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
-rw-------. 1 amul amul 171 Nov 17 07:33 00000005.history
-rw-------. 1 amul amul 209 Nov 17 07:33 00000006.history
-rw-------. 1 amul amul 247 Nov 17 07:33 00000007.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F

The last one is the new WAL file created in that cluster.

With this experiment, I think it is clear that the EndOfLogTLI can be
different from the replayEndTLI or lastReplayedTLI, and we don't have
any other option to get that into other processes other than exporting
into shared memory. Similarly, we have bunch of option (e.g.
replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get
EndOfLog value but those are not perfect and reliable options.

Therefore, in the attached patch, I have exported EndOfLog and
EndOfLogTLI into shared memory and attached only the refactoring
patches since there a bunch of other work needs to be done on the main
ASRO patches what I discussed with Robert off-list, thanks.

Regards,
Amul

Attachments:

v42-0002-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v42-0002-Remove-dependencies-on-startup-process-specifica.patchDownload

From a7b49c32077b380c515845f0e8c7c5e6cd13104a Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v42 2/2] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr &
ArchiveRecoveryRequested is made accessible by copying into shared
memory. missingContrecPtr can get from the EndOfLog shared memory
values.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Both of these are also exported into
shared memory since non of the existing shared memory variable matches
exactly with these values.
---
 src/backend/access/transam/xlog.c | 93 +++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a18be49091f..7faf6675be6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -660,6 +660,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -709,6 +716,21 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
+	/*
+	 * Determines an endpoint that we consider a valid portion of WAL when
+	 * server startup.  It is invalid during recovery and does not change once
+	 * set.
+	 */
+	XLogRecPtr	endOfLog;
+	TimeLineID	endOfLogTLI;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -886,9 +908,7 @@ static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog,
 								TimeLineID newTLI);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog,
-										TimeLineID newTLI);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -942,8 +962,7 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							 TimeLineID newTLI);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -5603,6 +5622,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5793,9 +5817,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog, TimeLineID newTLI)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							TimeLineID newTLI)
+CleanupAfterArchiveRecovery(void)
 {
+	XLogRecPtr	EndOfLog = XLogCtl->endOfLog;
+	TimeLineID	EndOfLogTLI = XLogCtl->endOfLogTLI;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5813,7 +5839,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
-	RemoveNonParentXlogFiles(EndOfLog, newTLI);
+	RemoveNonParentXlogFiles(EndOfLog, XLogCtl->InsertTimeLineID);
 
 	/*
 	 * If the switch happened in the middle of a segment, what to do with the
@@ -8041,6 +8067,16 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8109,6 +8145,13 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
+	/*
+	 * Store EndOfLog and EndOfLogTLI into shared memory to share with other
+	 * processes.
+	 */
+	XLogCtl->endOfLog = EndOfLog;
+	XLogCtl->endOfLogTLI = EndOfLogTLI;
+
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
@@ -8139,8 +8182,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager writes
+	 * cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, newTLI);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8200,30 +8250,29 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-				 TimeLineID newTLI)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(XLogCtl->SharedAbortedRecPtr))
 	{
+		/* Restore values */
+		abortedRecPtr = XLogCtl->SharedAbortedRecPtr;
+		missingContrecPtr = XLogCtl->endOfLog;
+
 		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
 		CreateOverwriteContrecordRecord(abortedRecPtr);
+
+		XLogCtl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 		abortedRecPtr = InvalidXLogRecPtr;
 		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
@@ -8246,7 +8295,7 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8381,8 +8430,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v42-0001-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v42-0001-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 38764502922f09dcdcec54e3945adfaae9616760 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v42 1/2] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 112 +++++++++++++++++-------------
 1 file changed, 65 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 221e4cb34f8..a18be49091f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -942,6 +942,8 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+							 TimeLineID newTLI);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -8137,53 +8139,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 *
-	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
-	 * entered recovery. Even if we ultimately replayed no WAL records, it will
-	 * have been initialized based on where replay was due to start.  We don't
-	 * need a lock to access this, since this can't change any more by the time
-	 * we reach this code.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, newTLI);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8239,6 +8196,67 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+				 TimeLineID newTLI)
+{
+	bool		promoted = false;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
+	 * entered recovery. Even if we ultimately replayed no WAL records, it will
+	 * have been initialized based on where replay was due to start.  We don't
+	 * need a lock to access this, since this can't change any more by the time
+	 * we reach this code.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#193

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Amul Sul (#192)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Nov 23, 2021 at 7:23 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:

On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

Another reason could be EndOfLog changes further in the following case:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
XLogFileReadAnyTLI(), I think I could make a sense that there could be
a case where the record belong to TLI 1 we are looking for; we might
open the file with TLI 2. But, I am wondering what's wrong if we say
that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
in its file name -- that statement is still true, and that record
should be still accessible from the filename with TLI 1. Also, if we
going to consider this reading record exists before the timeline
switch point as the EndOfLog then why should be worried about the
latter timeline switch which eventually everything after the EndOfLog
going to be useless for us. We might continue switching TLI and/or
writing the WAL right after EndOfLog, correct me if I am missing
something here.

Further, I still think replayEndTLI has set to the correct value what
we looking for EndOfLogTLI when we go through the redo loop. When it
read the record and finds a change in the current replayTLI then it
updates that as:

if (newReplayTLI != replayTLI)
{
/* Check that it's OK to switch to this TLI */
checkTimeLineSwitch(EndRecPtr, newReplayTLI,
prevReplayTLI, replayTLI);

/* Following WAL records should be run with new TLI */
replayTLI = newReplayTLI;
switchedTLI = true;
}

Then replayEndTLI gets updated. If we going to skip the reread of
"LastRec" that we were discussing, then I think the following code
that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
(or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
should be enough, AFAICU.

/*
* EndOfLogTLI is the TLI in the filename of the XLOG segment containing
* the end-of-log. It could be different from the timeline that EndOfLog
* nominally belongs to, if there was a timeline switch in that segment,
* and we were reading the old WAL from a segment belonging to a higher
* timeline.
*/
EndOfLogTLI = xlogreader->seg.ws_tli;

I think I found the right case for this, above TLI fetch is needed in
the case where we do restore from the archived WAL files. In my trial,
the archive directory has files as below (Kindly ignore the extra
history file, I perform a few more trials to be sure):

-rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
-rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
-rw-------. 1 amul amul 128 Nov 17 06:36 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
-rw-------. 1 amul amul 171 Nov 17 06:39 00000005.history
-rw-------. 1 amul amul 209 Nov 17 06:45 00000006.history
-rw-------. 1 amul amul 247 Nov 17 06:52 00000007.history

The timeline is switched in 1F file but the archiver has backup older
timeline file and renamed it. While performing PITR using these
archived files, the .partitial file seems to be skipped from the
restore. The file with the next timeline id is selected to read the
records that belong to the previous timeline id as well (i.e. 4 here,
all the records before timeline switch point). Here is the files
inside pg_wal directory after restore, note that in the current
experiment, I chose recovery_target_xid = <just before the timeline#5
switch point > and then recovery_target_action = 'promote':

-rw-------. 1 amul amul 85 Nov 17 07:33 00000003.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
-rw-------. 1 amul amul 128 Nov 17 07:33 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
-rw-------. 1 amul amul 171 Nov 17 07:33 00000005.history
-rw-------. 1 amul amul 209 Nov 17 07:33 00000006.history
-rw-------. 1 amul amul 247 Nov 17 07:33 00000007.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F

The last one is the new WAL file created in that cluster.

With this experiment, I think it is clear that the EndOfLogTLI can be
different from the replayEndTLI or lastReplayedTLI, and we don't have
any other option to get that into other processes other than exporting
into shared memory. Similarly, we have bunch of option (e.g.
replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get
EndOfLog value but those are not perfect and reliable options.

Therefore, in the attached patch, I have exported EndOfLog and
EndOfLogTLI into shared memory and attached only the refactoring
patches since there a bunch of other work needs to be done on the main
ASRO patches what I discussed with Robert off-list, thanks.

Attaching the rest of the patches. To execute XLogAcceptWrites() ->
PerformRecoveryXLogAction() in Checkpointer process; ideally, we
should perform full checkpoint but we can't do that using current
PerformRecoveryXLogAction() which would call RequestCheckpoint() with
WAIT flags which make the Checkpointer process wait infinite on itself
to finish the requested checkpoint, bad!!

The option we have is to change RequestCheckpoint() for the
Checkpointer process directly call CreateCheckPoint() as we do for
!IsPostmasterEnvironment case, but problem is that XLogWrite() running
inside Checkpointer process can reach to CreateCheckPoint() and cause
an unexpected behaviour that I have noted previously[1]. The
RequestCheckpoint() from XLogWrite() when inside Checkpointer process
is needed or not is need a separate discussion. For now, I have
changed PerformRecoveryXLogAction() to call CreateCheckPoint() for the
Checkpointer process; in the v41-0003 version I tried to do the
changes to RequestCheckpoint() to avoid that but that change looks too
ugly.

Another problem is the recursive call to XLogAccepWrite() in the
Checkpointer process due to the aforesaid CreateCheckPoint() call from
PerformRecoveryXLogAction(). The reason is to avoid the delay in
processing WAL prohibit state change requests we do have added
ProcessWALProhibitStateChangeRequest() call multiple places that
Checkpointer can check and process while performing a long-running
checkpoint. When Checkpointer call CreateCheckPoint() from
PerformRecoveryXLogAction() then that can also hit
ProcessWALProhibitStateChangeRequest() and since XLogAccepWrite()
operation not completed yet that tried to do that again. To avoid that
I have added a flag that avoids ProcessWALProhibitStateChangeRequest()
execution is that flag is set, see
ProcessWALProhibitStateChangeRequest() in attached 0003 patch.

Note that both the issues, I noted above are boil down to
CreateCheckPoint() and its need. If we don't need to perform a full
checkpoint in our case then we might not have that recursion issue.
Instead, do the CreateEndOfRecoveryRecord() and then do the full
checkpoint that currently PerformRecoveryXLogAction() does for the
promotion case but not having full checkpoint looks might look scary.
I tried that and works fine for me, but I am not much confident about
that.

Regards,
Amul

1] /messages/by-id/CAAJ_b97fPWU_yyOg97Y5AtSvx5mrg2cGyz260swz5x5iPKEM+g@mail.gmail.com

Attachments:

v43-0006-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/x-patch; name=v43-0006-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From 93a79d85670c0a915da6daef4c53629587626efd Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v43 6/6] Test: Few tap tests for wal prohibited system

Does following testing:

1. Basic verification like insert into normal and unlogged table on
   wal prohibited system.
2. Check permission to non-superuser to alter wal prohibited system
   state.
3. Verify open write transaction disconnection when system state has
   been changed to wal prohibited.
4. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
5. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
6. Standby server cannot be in wal prohibited, standby.signal and/or
   recovery.signal take out system from wal prohibited state.
7. Terminate session running transaction performed write but not
   committed yet while changing state to WAL prohibited.
8. Changes 026_overwrite_contrecord.pl test to check with WAL
   prohibited system. (XXX: Should make copy of this file for WAL
   prohibited testing, I think, not needed).
---
 .../recovery/t/026_overwrite_contrecord.pl    |  11 +-
 src/test/recovery/t/027_pg_prohibit_wal.pl    | 216 ++++++++++++++++++
 2 files changed, 223 insertions(+), 4 deletions(-)
 create mode 100644 src/test/recovery/t/027_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/026_overwrite_contrecord.pl b/src/test/recovery/t/026_overwrite_contrecord.pl
index b78c2fd7912..2dfb1d22809 100644
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -65,10 +65,11 @@ my $endfile = $node->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_insert_lsn())');
 ok($initfile ne $endfile, "$initfile differs from $endfile");
 
-# Now stop abruptly, to avoid a stop checkpoint.  We can remove the tail file
-# afterwards, and on startup the large message should be overwritten with new
-# contents
-$node->stop('immediate');
+# Change system to wal prohibited that will skip shutdown checkpoint.  We can
+# remove the tail file afterwards, and on startup the large message should be
+# overwritten with new contents
+$node->safe_psql('postgres', qq{SELECT pg_prohibit_wal(true)});
+$node->stop;
 
 unlink $node->basedir . "/pgdata/pg_wal/$endfile"
   or die "could not unlink " . $node->basedir . "/pgdata/pg_wal/$endfile: $!";
@@ -81,6 +82,8 @@ $node_standby->init_from_backup($node, 'backup', has_streaming => 1);
 $node_standby->start;
 $node->start;
 
+# Change system to wal permitted now.
+$node->safe_psql('postgres', qq{SELECT pg_prohibit_wal(false)});
 $node->safe_psql('postgres',
 	qq{create table foo (a text); insert into foo values ('hello')});
 $node->safe_psql('postgres',
diff --git a/src/test/recovery/t/027_pg_prohibit_wal.pl b/src/test/recovery/t/027_pg_prohibit_wal.pl
new file mode 100644
index 00000000000..b426cfc9aa0
--- /dev/null
+++ b/src/test/recovery/t/027_pg_prohibit_wal.pl
@@ -0,0 +1,216 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test wal prohibited state.
+use strict;
+use warnings;
+use FindBin;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 22;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize database node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Create few tables and insert some data
+$node_primary->safe_psql('postgres',  <<EOSQL);
+CREATE TABLE tab AS SELECT 1 AS i;
+CREATE UNLOGGED TABLE unlogtab AS SELECT 1 AS i;
+EOSQL
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is now wal prohibited');
+
+#
+# In wal prohibited state, further table insert will fail.
+#
+# Note that even though inter into unlogged and temporary table doesn't generate
+# wal but the transaction does that insert operation will acquire transaction id
+# which is not allowed on wal prohibited system. Also, that transaction's abort
+# or commit state will be wal logged at the end which is prohibited as well.
+#
+my ($stdout, $stderr, $timed_out);
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(2)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, table insert is failed');
+$node_primary->psql('postgres', 'INSERT INTO unlogtab VALUES(2)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, unlogged table insert is failed');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on server, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint, last checkpoint lsn : $checkpoint_lsn");
+
+# Change server to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+is($new_checkpoint_lsn ne $checkpoint_lsn, 1,
+	"new checkpoint performed, new checkpoint lsn : $new_checkpoint_lsn");
+
+my $new_write_lsn = $node_primary->lsn('write');
+is($new_write_lsn ne $write_lsn, 1,
+	"new wal writes on server, new latest wal write lsn : $new_write_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(2)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '2',
+	'table insert passed');
+
+# Only the superuser and the user who granted permission able to call
+# pg_prohibit_wal to change wal prohibited state.
+$node_primary->safe_psql('postgres', 'CREATE USER non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+like($stderr, qr/permission denied for function pg_prohibit_wal/,
+	'permission denied to non-superuser for alter wal prohibited state');
+$node_primary->safe_psql('postgres', 'GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'granted permission to non-superuser, able to alter wal prohibited state');
+
+# back to normal state
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(false)');
+
+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$mysession_stdin,
+	'>',
+	\$mysession_stdout,
+	'2>',
+	\$mysession_stderr,
+	$psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(4);
+SELECT $$value-4-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-4-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-4-inserted-into-tab/,
+	'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+	'session with open write transaction is terminated');
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Recovery server cannot be put into wal prohibited state.
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(3)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(3)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '3',
+	'insert passed on primary');
+
+# Wait for standbys to catch up
+$node_primary->wait_for_catchup($node_standby, 'write');
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '3',
+	'new insert replicated on standby as well');
+
+
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v43-0002-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v43-0002-Remove-dependencies-on-startup-process-specifica.patchDownload

From 55e99daad99ec2f4fae907938b5cf38a2afe11dd Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v43 2/6] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr &
ArchiveRecoveryRequested is made accessible by copying into shared
memory. missingContrecPtr can get from the EndOfLog shared memory
values.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Both of these are also exported into
shared memory since non of the existing shared memory variable matches
exactly with these values.

Also, make sure to use a volatile pointer to access XLogCtl to read
the latest shared variable values.
---
 src/backend/access/transam/xlog.c | 107 +++++++++++++++++++++++-------
 1 file changed, 84 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e523eb40b9a..e6fed15516c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -660,6 +660,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -709,6 +716,21 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
+	/*
+	 * Determines an endpoint that we consider a valid portion of WAL when
+	 * server startup.  It is invalid during recovery and does not change once
+	 * set.
+	 */
+	XLogRecPtr	endOfLog;
+	TimeLineID	endOfLogTLI;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -882,9 +904,7 @@ static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog,
 								TimeLineID newTLI);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog,
-										TimeLineID newTLI);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -939,8 +959,7 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							 TimeLineID newTLI);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -5599,6 +5618,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5789,9 +5813,17 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog, TimeLineID newTLI)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							TimeLineID newTLI)
+CleanupAfterArchiveRecovery(void)
 {
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	XLogRecPtr	EndOfLog = xlogctl->endOfLog;
+	TimeLineID	EndOfLogTLI = xlogctl->endOfLogTLI;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5809,7 +5841,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
-	RemoveNonParentXlogFiles(EndOfLog, newTLI);
+	RemoveNonParentXlogFiles(EndOfLog, xlogctl->InsertTimeLineID);
 
 	/*
 	 * If the switch happened in the middle of a segment, what to do with the
@@ -8038,6 +8070,16 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8106,6 +8148,13 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
+	/*
+	 * Store EndOfLog and EndOfLogTLI into shared memory to share with other
+	 * processes.
+	 */
+	XLogCtl->endOfLog = EndOfLog;
+	XLogCtl->endOfLogTLI = EndOfLogTLI;
+
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
@@ -8136,8 +8185,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager writes
+	 * cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, newTLI);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8197,30 +8253,35 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-				 TimeLineID newTLI)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(xlogctl->SharedAbortedRecPtr))
 	{
+		/* Restore values */
+		abortedRecPtr = xlogctl->SharedAbortedRecPtr;
+		missingContrecPtr = xlogctl->endOfLog;
+
 		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
 		CreateOverwriteContrecordRecord(abortedRecPtr);
+
+		xlogctl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 		abortedRecPtr = InvalidXLogRecPtr;
 		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
@@ -8232,7 +8293,7 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 	 * need a lock to access this, since this can't change any more by the time
 	 * we reach this code.
 	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+	if (!XLogRecPtrIsInvalid(xlogctl->lastReplayedEndRecPtr))
 		promoted = PerformRecoveryXLogAction();
 
 	/*
@@ -8243,7 +8304,7 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8378,8 +8439,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v43-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v43-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 004601c12b8c1deccea057043074713ecc897878 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v43 4/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 +++++--
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++---
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 +++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 ++++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++---
 src/backend/access/heap/visibilitymap.c   | 19 ++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 24 ++++++++++---
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 ++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 30 +++++++++++-----
 src/backend/access/transam/xloginsert.c   | 21 +++++++++--
 src/backend/commands/sequence.c           | 16 +++++++++
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 44 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 ++++++++++++++
 39 files changed, 516 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..a3718246588 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index fbccf3d038d..e252b2c22a8 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
 			computeLeafRecompressWALData(leaf);
+			CheckWALPermitted();
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..76630b12490 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..5c7b5fc9e9d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6d2d71be32b..7b321c69880 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..e57e83c8c4d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index eb3810494f2..a47a3dd84cc 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b1..4ea7b1c934f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b312af57e11..197d226f2ec 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+		CheckWALPermitted();
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
 						XLogEnsureRecordSpace(0, 3 + nitups);
+						CheckWALPermitted();
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 159646c7c3e..d1989e93b35 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ec234a5e595..fe63d241f3b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3705,6 +3712,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3889,6 +3898,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4821,6 +4832,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5611,6 +5624,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5769,6 +5784,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5877,6 +5894,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5997,6 +6016,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6027,6 +6047,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6037,7 +6061,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 5c0b60319d8..2d08b58323c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -252,6 +253,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -306,6 +308,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -339,7 +345,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88b9d1f41c3..43f8f50420d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1326,6 +1327,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1341,8 +1347,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1947,8 +1952,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1973,7 +1983,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2400,6 +2410,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2410,6 +2421,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2440,7 +2454,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..6fb0c282486 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -272,12 +274,19 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
+		/*
+		 * Can reach here from VACUUM or from startup process, so need not have an
+		 * XID.
+		 */
+		if (needwal && XLogRecPtrIsInvalid(recptr))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +483,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +497,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +531,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index c88dc6eedbd..9ed8039d730 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -235,6 +236,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0fe8c709395..fb6d0a59055 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -1240,6 +1241,7 @@ _bt_insertonpg(Relation rel,
 		Page		metapg = NULL;
 		BTMetaPageData *metad = NULL;
 		BlockNumber blockcache;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		/*
 		 * If we are doing this insert because we split a page that was the
@@ -1265,6 +1267,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1303,7 +1308,7 @@ _bt_insertonpg(Relation rel,
 		}
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
@@ -1488,6 +1493,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	bool		newitemonleft,
 				isleaf,
 				isrightmost;
+	bool		needwal;
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1915,13 +1921,18 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -1958,7 +1969,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_split xlrec;
 		uint8		xlinfo;
@@ -2446,6 +2457,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
@@ -2483,6 +2495,10 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
@@ -2540,7 +2556,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_newroot xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5bc7c3616a9..b4fb0a63091 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index 70557bcf3d0..caafd1dd916 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -214,6 +215,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -458,6 +461,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1131,6 +1136,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1539,6 +1546,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1625,6 +1634,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1810,6 +1821,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 28b153abc3c..561b67bb712 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1164,6 +1165,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2265,6 +2268,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2363,6 +2369,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 92d2fa82dfe..8f7d394eaa9 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 7a6afea9f3f..a4687cfffb7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1368,6 +1369,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1728,6 +1731,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b8d009160f3..61e5258bddc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1069,7 +1069,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*
 	 * Given that we're not in recovery, InsertTimeLineID is set and can't
@@ -2949,9 +2949,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9586,6 +9588,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9752,6 +9757,9 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9809,6 +9817,9 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn)
 	xlrec.overwritten_lsn = aborted_lsn;
 	xlrec.overwrite_time = GetCurrentTimestamp();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10456,7 +10467,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10470,10 +10481,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10495,8 +10506,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 689384a411f..2b4b6040050 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -139,9 +140,20 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
-	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
+	/*
+	 * Cross-check on whether we should be here or not.
+	 *
+	 * This check is primarily for a non-critical section that never insists the
+	 * same WAL write permission check before reaching here.
+	 */
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -219,6 +231,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b5bf7e4efe9..089d8552c2e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -934,6 +934,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96a..045f3a48da3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3888,13 +3888,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index 2faed6c100f..66b55756ef3 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -57,4 +58,47 @@ GetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	Assert(XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 #endif							/* WALPROHIBIT_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a30160657..b438ec31fc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v43-0005-Documentation.patchapplication/x-patch; name=v43-0005-Documentation.patchDownload

From 37d72603a985d4c91f86afc1af6f430ba4448557 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v43 5/6] Documentation.

---
 doc/src/sgml/func.sgml              | 20 ++++++++++
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 doc/src/sgml/monitoring.sgml        |  4 ++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 5 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0a725a67117..10469a8a41d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25420,6 +25420,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c43f2140205..98b660941b1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index af6914872b1..4738607e167 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1549,6 +1549,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>SystemWALProhibitStateChange</literal></entry>
+      <entry>Waiting for a wal prohibited state change.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v43-0003-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v43-0003-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 81bbadd8cfc7a8ef281e5ad90736505bc1f77e93 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v43 3/6] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 490 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 191 ++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   8 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  27 ++
 src/backend/storage/ipc/ipci.c           |   7 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  31 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  60 +++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   3 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 882 insertions(+), 75 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..92d2fa82dfe
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,490 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool WALProhibitStateChangeIsInProgress;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState(GetWALProhibitCounter()) ==
+			   WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ *
+ *	NB: Function always returns true that leaves scope for the future code
+ *	changes might need to return false for some reason.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+	bool		increment;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	wal_prohibit_counter = GetWALProhibitCounter();
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState(wal_prohibit_counter))
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			increment = true;
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			increment = false;
+			break;
+	}
+
+	if (increment)
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state = GetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		CompleteWALProhibitChange();
+		PG_RETURN_BOOL(true);
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_BOOL(true);		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState(GetWALProhibitCounter()) !=
+			WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state counter */
+	uint32		wal_prohibit_counter = GetWALProhibitCounter();
+	WALProhibitState cur_state = GetWALProhibitState(wal_prohibit_counter);
+
+	/*
+	 * Must be called by Checkpointer.  Otherwise, it must be single-user
+	 * backend.
+	 */
+	Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
+
+	/* Should be here only in transition state */
+	Assert(cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY ||
+		   cur_state == WALPROHIBIT_STATE_GOING_READ_WRITE);
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		/*
+		 * There won't be any other process for the final state transition so that
+		 * the shared wal prohibit state counter shouldn't have been changed by
+		 * now.
+		 */
+		Assert(GetWALProhibitCounter() == wal_prohibit_counter);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(GetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up all backends waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ResetWALProhibitStateChangeFlag()
+ */
+void
+ResetWALProhibitStateChangeFlag(void)
+{
+	WALProhibitStateChangeIsInProgress = false;
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	WALProhibitState cur_state;
+
+	/*
+	 * Must be called by the checkpointer process.  Checkpointer has to be
+	 * sure it has processed all pending wal prohibit state change requests as
+	 * soon as possible.  Since CreateCheckPoint and ProcessSyncRequests
+	 * sometimes runs in non-checkpointer processes, do nothing if not
+	 * checkpointer.
+	 */
+	if (!AmCheckpointerProcess())
+		return;
+
+	/*
+	 * Quick exit if the state transition is already in progress to avoid a
+	 * recursive call to process wal prohibit state transition in some case e.g.
+	 * the end-of-recovery checkpoint.
+	 */
+	if (WALProhibitStateChangeIsInProgress)
+		return;
+
+	WALProhibitStateChangeIsInProgress = true;
+
+	do
+	{
+		/* Get the latest state */
+		cur_state = GetWALProhibitState(GetWALProhibitCounter());
+
+		switch (cur_state)
+		{
+			case WALPROHIBIT_STATE_GOING_READ_WRITE:
+
+				/*
+				 * If the server is started in wal prohibited state then the
+				 * required wal write operation in the startup process to
+				 * start the server normally has been skipped, if it is, then
+				 * does that right away.
+				 */
+				if (GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+					PerformPendingXLogAcceptWrites();
+
+				/* fall through */
+
+			case WALPROHIBIT_STATE_GOING_READ_ONLY:
+				CompleteWALProhibitChange();
+				break;
+
+			case WALPROHIBIT_STATE_READ_ONLY:
+				{
+					int			rc;
+
+					/*
+					 * Don't let Checkpointer process do anything until
+					 * someone wakes it up.  For example a backend might later
+					 * on request us to put the system back to read-write
+					 * state.
+					 */
+					rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH,
+								   -1, WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+
+					/*
+					 * If the postmaster dies or a shutdown request is
+					 * received, just bail out.
+					 */
+					if (rc & WL_POSTMASTER_DEATH || ShutdownRequestPending)
+						return;
+				}
+				break;
+
+			case WALPROHIBIT_STATE_READ_WRITE:
+				break;			/* Done */
+		}
+	} while (cur_state != WALPROHIBIT_STATE_READ_WRITE);
+
+	WALProhibitStateChangeIsInProgress = false;
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8e35c432f5c..7a6afea9f3f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2013,23 +2013,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e6fed15516c..b8d009160f3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -237,9 +238,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -746,6 +748,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * xlogAllowWritesState indicates the state of the last recovery checkpoint
+	 * and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5064,6 +5072,17 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+	void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5340,6 +5359,7 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6504,6 +6524,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6876,13 +6905,30 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ControlFile->time = (pg_time_t) time(NULL);
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8192,8 +8238,29 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 
-	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites();
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		promoted = XLogAcceptWrites();
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8263,6 +8330,20 @@ XLogAcceptWrites(void)
 	 */
 	volatile XLogCtlData *xlogctl = XLogCtl;
 
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
+
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
@@ -8312,9 +8393,39 @@ XLogAcceptWrites(void)
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	/* Prepare to accept WAL writes. */
+	(void) XLogAcceptWrites();
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	ControlFile->time = (pg_time_t) time(NULL);
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+ }
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8455,6 +8566,11 @@ PerformRecoveryXLogAction(void)
 		 */
 		CreateEndOfRecoveryRecord();
 	}
+	else if (AmCheckpointerProcess())
+	{
+		/* In checkpointer process, just do it ourselves */
+		CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+	}
 	else
 	{
 		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
@@ -8583,9 +8699,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8604,9 +8720,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8633,6 +8760,12 @@ LocalSetXLogInsertAllowed(void)
 	return oldXLogAllowed;
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8934,9 +9067,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8949,6 +9086,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9199,8 +9339,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Prepare to accumulate statistics.
@@ -9593,10 +9738,12 @@ CreateEndOfRecoveryRecord(void)
 {
 	xl_end_of_recovery xlrec;
 	XLogRecPtr	recptr;
+	XLogAcceptWritesState state = GetXLogWriteAllowedState();
 
 	/* sanity check */
-	if (!RecoveryInProgress())
-		elog(ERROR, "can only be used to end recovery");
+	if (state != XLOG_ACCEPT_WRITES_PENDING &&
+		state != XLOG_ACCEPT_WRITES_SKIPPED)
+		elog(ERROR, "can only be used at enabling WAL writes");
 
 	xlrec.end_time = GetCurrentTimestamp();
 
@@ -9652,10 +9799,12 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn)
 {
 	xl_overwrite_contrecord xlrec;
 	XLogRecPtr	recptr;
+	XLogAcceptWritesState state = GetXLogWriteAllowedState();
 
 	/* sanity check */
-	if (!RecoveryInProgress())
-		elog(ERROR, "can only be used at end of recovery");
+	if (state != XLOG_ACCEPT_WRITES_PENDING &&
+		state != XLOG_ACCEPT_WRITES_SKIPPED)
+		elog(ERROR, "can only be used at enabling WAL writes");
 
 	xlrec.overwritten_lsn = aborted_lsn;
 	xlrec.overwrite_time = GetCurrentTimestamp();
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index f6789025a5f..cc6d81cfff2 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -707,6 +707,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_logicalmapdir() FROM PUBLIC;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_replslotdir(text) FROM PUBLIC;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a9223e7b108..824e762b684 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,12 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the wal writes are permitted.  Second, we
+		 * need to make sure that there is a worker slot available.  Third, we
+		 * need to make sure that no other worker failed while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..b5bf7e4efe9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -281,6 +282,12 @@ CheckpointerMain(void)
 			ckpt_active = false;
 		}
 
+		/*
+		 * Reset the WALProhibitState change status flag too, so that will be
+		 * restarted if needed.
+		 */
+		ResetWALProhibitStateChangeFlag();
+
 		/*
 		 * Now return to normal top-level context and clear ErrorContext for
 		 * next time.
@@ -348,6 +355,7 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -692,6 +700,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1341,3 +1352,19 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows any process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9fa3e0631e6..9b391cb9cc2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -247,6 +248,12 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up shared memory structure need to handle concurrent WAL prohibit
+	 * state change requests.
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 6e69398cdda..762d5970f44 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c25af7fe090..b595c0db1bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56f..b27625f4845 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -241,9 +242,17 @@ SyncPostCheckpoint(void)
 		entry->canceled = true;
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
-		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop.
+		 * As in ProcessSyncRequests, we don't want to stop processing wal
+		 * prohibit change requests for a long time when there are many
+		 * deletions to be done.  It needs to be check and processed by
+		 * checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -302,6 +311,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -360,6 +372,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop processing wal prohibit change requests for a long
+		 * time when there are many fsync requests to be processed.  It needs to
+		 * be check and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -446,6 +465,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the same
+				 * function call.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d471..7bc3bd369b1 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4d53f040e81..60e5b985d00 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfda..67ea808c4b9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -235,6 +236,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -677,6 +679,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2119,6 +2122,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12572,4 +12587,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..2faed6c100f
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,60 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ResetWALProhibitStateChangeFlag(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+GetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 898df2ee034..c8685142e56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of XLogAcceptWrites() execution */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped XLogAcceptWrites() */
+	XLOG_ACCEPT_WRITES_DONE			/* done with XLogAcceptWrites() */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -279,6 +287,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -287,8 +296,10 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern uint64 GetSystemIdentifier(void);
 extern char *GetMockAuthenticationNonce(void);
 extern bool DataChecksumsEnabled(void);
@@ -299,6 +310,7 @@ extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 749bce0cc6f..19cf88d24ba 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -184,6 +184,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e934361dc32..d4b9c308863 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11671,6 +11671,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4545', descr => 'change server to permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'bool',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 8785a8e12c1..22db80c81c1 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -225,7 +225,8 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index da6ac8ed83e..60622118874 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2834,6 +2834,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v43-0001-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v43-0001-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 6b7026398ddc905199953004f099527508b2a6be Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v43 1/6] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 112 +++++++++++++++++-------------
 1 file changed, 65 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b54ec549705..e523eb40b9a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -939,6 +939,8 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+							 TimeLineID newTLI);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -8134,53 +8136,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 *
-	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
-	 * entered recovery. Even if we ultimately replayed no WAL records, it will
-	 * have been initialized based on where replay was due to start.  We don't
-	 * need a lock to access this, since this can't change any more by the time
-	 * we reach this code.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, newTLI);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8236,6 +8193,67 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+				 TimeLineID newTLI)
+{
+	bool		promoted = false;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
+	 * entered recovery. Even if we ultimately replayed no WAL records, it will
+	 * have been initialized based on where replay was due to start.  We don't
+	 * need a lock to access this, since this can't change any more by the time
+	 * we reach this code.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#194

Amul Sul

sulamul@gmail.com

about 4 years ago

In reply to: Amul Sul (#193)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attaching the later version, has a few additional changes that decide
for the Checkpointer process where it should be halt or not in the wal
prohibited state; those changes are yet to be confirmed and tested
thoroughly, thanks.

Regards,
Amul

Attachments:

v44-0002-Remove-dependencies-on-startup-process-specifica.patchapplication/x-patch; name=v44-0002-Remove-dependencies-on-startup-process-specifica.patchDownload

From 6bb6149c5eca94de0b4315092f508090b309ea4b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v44 2/6] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr &
ArchiveRecoveryRequested is made accessible by copying into shared
memory. missingContrecPtr can get from the EndOfLog shared memory
values.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Both of these are also exported into
shared memory since non of the existing shared memory variable matches
exactly with these values.

Also, make sure to use a volatile pointer to access XLogCtl to read
the latest shared variable values.
---
 src/backend/access/transam/xlog.c | 107 +++++++++++++++++++++++-------
 1 file changed, 84 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 210b982cc56..98af35b6cf3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -660,6 +660,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -709,6 +716,21 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
+	/*
+	 * Determines an endpoint that we consider a valid portion of WAL when
+	 * server startup.  It is invalid during recovery and does not change once
+	 * set.
+	 */
+	XLogRecPtr	endOfLog;
+	TimeLineID	endOfLogTLI;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -882,9 +904,7 @@ static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog,
 								TimeLineID newTLI);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog,
-										TimeLineID newTLI);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -939,8 +959,7 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							 TimeLineID newTLI);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -5599,6 +5618,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -5789,9 +5813,17 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog, TimeLineID newTLI)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							TimeLineID newTLI)
+CleanupAfterArchiveRecovery(void)
 {
+	/*
+	 * Use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	XLogRecPtr	EndOfLog = xlogctl->endOfLog;
+	TimeLineID	EndOfLogTLI = xlogctl->endOfLogTLI;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -5809,7 +5841,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 	 * files containing garbage. In any case, they are not part of the new
 	 * timeline's history so we don't need them.
 	 */
-	RemoveNonParentXlogFiles(EndOfLog, newTLI);
+	RemoveNonParentXlogFiles(EndOfLog, xlogctl->InsertTimeLineID);
 
 	/*
 	 * If the switch happened in the middle of a segment, what to do with the
@@ -8038,6 +8070,16 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8106,6 +8148,13 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
+	/*
+	 * Store EndOfLog and EndOfLogTLI into shared memory to share with other
+	 * processes.
+	 */
+	XLogCtl->endOfLog = EndOfLog;
+	XLogCtl->endOfLogTLI = EndOfLogTLI;
+
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
@@ -8136,8 +8185,15 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager writes
+	 * cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, newTLI);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8196,30 +8252,35 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-				 TimeLineID newTLI)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * Use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(xlogctl->SharedAbortedRecPtr))
 	{
+		/* Restore values */
+		abortedRecPtr = xlogctl->SharedAbortedRecPtr;
+		missingContrecPtr = xlogctl->endOfLog;
+
 		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
 		CreateOverwriteContrecordRecord(abortedRecPtr);
+
+		xlogctl->SharedAbortedRecPtr = InvalidXLogRecPtr;
 		abortedRecPtr = InvalidXLogRecPtr;
 		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
@@ -8231,7 +8292,7 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 	 * need a lock to access this, since this can't change any more by the time
 	 * we reach this code.
 	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+	if (!XLogRecPtrIsInvalid(xlogctl->lastReplayedEndRecPtr))
 		promoted = PerformRecoveryXLogAction();
 
 	/*
@@ -8242,7 +8303,7 @@ XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
 	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -8377,8 +8438,8 @@ PerformRecoveryXLogAction(void)
 	 * a full checkpoint. A checkpoint is requested later, after we're fully out
 	 * of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-		LocalPromoteIsTriggered)
+	if (XLogCtl->SharedArchiveRecoveryRequested && IsUnderPostmaster &&
+		PromoteIsTriggered())
 	{
 		promoted = true;
 
-- 
2.18.0

v44-0005-Documentation.patchapplication/x-patch; name=v44-0005-Documentation.patchDownload

From 556d0e362bd164c87b23f08c9d79a4c426c7b1d8 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v44 5/6] Documentation.

---
 doc/src/sgml/func.sgml              | 20 ++++++++++
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 doc/src/sgml/monitoring.sgml        |  4 ++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 5 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0a725a67117..10469a8a41d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25420,6 +25420,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c43f2140205..98b660941b1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2339,4 +2339,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index af6914872b1..4738607e167 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1549,6 +1549,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>SystemWALProhibitStateChange</literal></entry>
+      <entry>Waiting for a wal prohibited state change.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c12..24dca70a6cc 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59ad..15f0bb4b7b5 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v44-0006-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/x-patch; name=v44-0006-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From e43e9bc9d67117582ccafb402279c3ff65c9949b Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v44 6/6] Test: Few tap tests for wal prohibited system

Does following testing:

1. Basic verification like insert into normal and unlogged table on
   wal prohibited system.
2. Check permission to non-superuser to alter wal prohibited system
   state.
3. Verify open write transaction disconnection when system state has
   been changed to wal prohibited.
4. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
5. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
6. Standby server cannot be in wal prohibited, standby.signal and/or
   recovery.signal take out system from wal prohibited state.
7. Terminate session running transaction performed write but not
   committed yet while changing state to WAL prohibited.
8. Changes 026_overwrite_contrecord.pl test to check with WAL
   prohibited system. (XXX: Should make copy of this file for WAL
   prohibited testing, I think, not needed).
---
 .../recovery/t/026_overwrite_contrecord.pl    |  11 +-
 src/test/recovery/t/027_pg_prohibit_wal.pl    | 216 ++++++++++++++++++
 2 files changed, 223 insertions(+), 4 deletions(-)
 create mode 100644 src/test/recovery/t/027_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/026_overwrite_contrecord.pl b/src/test/recovery/t/026_overwrite_contrecord.pl
index b78c2fd7912..2dfb1d22809 100644
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -65,10 +65,11 @@ my $endfile = $node->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_insert_lsn())');
 ok($initfile ne $endfile, "$initfile differs from $endfile");
 
-# Now stop abruptly, to avoid a stop checkpoint.  We can remove the tail file
-# afterwards, and on startup the large message should be overwritten with new
-# contents
-$node->stop('immediate');
+# Change system to wal prohibited that will skip shutdown checkpoint.  We can
+# remove the tail file afterwards, and on startup the large message should be
+# overwritten with new contents
+$node->safe_psql('postgres', qq{SELECT pg_prohibit_wal(true)});
+$node->stop;
 
 unlink $node->basedir . "/pgdata/pg_wal/$endfile"
   or die "could not unlink " . $node->basedir . "/pgdata/pg_wal/$endfile: $!";
@@ -81,6 +82,8 @@ $node_standby->init_from_backup($node, 'backup', has_streaming => 1);
 $node_standby->start;
 $node->start;
 
+# Change system to wal permitted now.
+$node->safe_psql('postgres', qq{SELECT pg_prohibit_wal(false)});
 $node->safe_psql('postgres',
 	qq{create table foo (a text); insert into foo values ('hello')});
 $node->safe_psql('postgres',
diff --git a/src/test/recovery/t/027_pg_prohibit_wal.pl b/src/test/recovery/t/027_pg_prohibit_wal.pl
new file mode 100644
index 00000000000..c283646d755
--- /dev/null
+++ b/src/test/recovery/t/027_pg_prohibit_wal.pl
@@ -0,0 +1,216 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test wal prohibited state.
+use strict;
+use warnings;
+use FindBin;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 22;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize database node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Create few tables and insert some data
+$node_primary->safe_psql('postgres',  <<EOSQL);
+CREATE TABLE tab AS SELECT 1 AS i;
+CREATE UNLOGGED TABLE unlogtab AS SELECT 1 AS i;
+EOSQL
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is now wal prohibited');
+
+#
+# In wal prohibited state, further table insert will fail.
+#
+# Note that even though inter into unlogged and temporary table doesn't generate
+# wal but the transaction does that insert operation will acquire transaction id
+# which is not allowed on wal prohibited system. Also, that transaction's abort
+# or commit state will be wal logged at the end which is prohibited as well.
+#
+my ($stdout, $stderr, $timed_out);
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(2)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, table insert is failed');
+$node_primary->psql('postgres', 'INSERT INTO unlogtab VALUES(2)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, unlogged table insert is failed');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on server, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint, last checkpoint lsn : $checkpoint_lsn");
+
+# Change server to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+ok($new_checkpoint_lsn ne $checkpoint_lsn,
+	"new checkpoint performed, new checkpoint lsn : $new_checkpoint_lsn");
+
+my $new_write_lsn = $node_primary->lsn('write');
+ok($new_write_lsn ne $write_lsn,
+	"new wal writes on server, new latest wal write lsn : $new_write_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(2)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '2',
+	'table insert passed');
+
+# Only the superuser and the user who granted permission able to call
+# pg_prohibit_wal to change wal prohibited state.
+$node_primary->safe_psql('postgres', 'CREATE USER non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+like($stderr, qr/permission denied for function pg_prohibit_wal/,
+	'permission denied to non-superuser for alter wal prohibited state');
+$node_primary->safe_psql('postgres', 'GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'granted permission to non-superuser, able to alter wal prohibited state');
+
+# back to normal state
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(false)');
+
+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$mysession_stdin,
+	'>',
+	\$mysession_stdout,
+	'2>',
+	\$mysession_stderr,
+	$psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(4);
+SELECT $$value-4-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-4-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-4-inserted-into-tab/,
+	'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+	'session with open write transaction is terminated');
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Recovery server cannot be put into wal prohibited state.
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(3)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(3)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '3',
+	'insert passed on primary');
+
+# Wait for standbys to catch up
+$node_primary->wait_for_catchup($node_standby, 'write');
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '3',
+	'new insert replicated on standby as well');
+
+
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v44-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/x-patch; name=v44-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From cff4dae41bced65157ef3a6a8c63147cbf728a4f Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v44 4/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 +++++--
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++---
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 +++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 ++++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++---
 src/backend/access/heap/visibilitymap.c   | 19 ++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 24 ++++++++++---
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 ++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 30 +++++++++++-----
 src/backend/access/transam/xloginsert.c   | 21 +++++++++--
 src/backend/commands/sequence.c           | 16 +++++++++
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 10 +++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 44 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 ++++++++++++++
 39 files changed, 516 insertions(+), 71 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 7edfe4f326f..f3108e0559a 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
 #include "miscadmin.h"
@@ -88,6 +89,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -99,6 +101,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -236,6 +239,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -316,12 +322,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959a..a3718246588 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -900,6 +901,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index df9ffc2fb86..270a881ab12 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index c574c8a06ef..76b81e65c53 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 482cf10877c..0d6dacfe84a 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index fbccf3d038d..e252b2c22a8 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
 			computeLeafRecompressWALData(leaf);
+			CheckWALPermitted();
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index e0d99409461..76630b12490 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0e8672c9e90..5c7b5fc9e9d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6d2d71be32b..7b321c69880 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -658,12 +659,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -683,7 +690,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a276eb020b5..049b5bde0a2 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c258..e57e83c8c4d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -135,6 +136,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -235,6 +239,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -469,8 +474,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -504,7 +512,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -530,6 +538,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -571,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1645,6 +1656,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1667,11 +1679,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1688,7 +1703,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531a..7af25731d3f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -272,6 +273,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -353,6 +355,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -360,7 +365,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -589,6 +594,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -643,6 +649,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -655,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index eb3810494f2..a47a3dd84cc 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -468,6 +469,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -574,6 +576,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -604,7 +610,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -691,6 +697,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -789,6 +796,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -810,7 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -884,6 +894,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -891,7 +904,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b1..4ea7b1c934f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -193,6 +194,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -360,6 +363,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -370,6 +374,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -393,7 +400,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b312af57e11..197d226f2ec 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -312,6 +313,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -510,6 +513,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -573,9 +577,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+		CheckWALPermitted();
+	}
 
 	START_CRIT_SECTION();
 
@@ -641,7 +650,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -922,14 +931,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
 						XLogEnsureRecordSpace(0, 3 + nitups);
+						CheckWALPermitted();
+					}
 
 					START_CRIT_SECTION();
 
@@ -947,7 +961,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 159646c7c3e..d1989e93b35 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -816,6 +817,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1172,6 +1175,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1223,6 +1228,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1269,6 +1276,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 29a4bf0c776..305b017204b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2103,6 +2104,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2387,6 +2390,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2947,6 +2952,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3687,6 +3694,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3870,6 +3879,8 @@ l2:
 										   bms_overlap(modified_attrs, id_attrs),
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4802,6 +4813,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5592,6 +5605,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5750,6 +5765,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5858,6 +5875,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5978,6 +5997,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6008,6 +6028,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6018,7 +6042,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 5c0b60319d8..2d08b58323c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
@@ -94,11 +95,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -252,6 +253,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -306,6 +308,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -339,7 +345,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ddd0bb98756..22135466a7e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -43,6 +43,7 @@
 #include "access/parallel.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -1327,6 +1328,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (!PageIsAllVisible(page))
 			{
+				bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+				if (needwal)
+					CheckWALPermitted();
+
 				START_CRIT_SECTION();
 
 				/* mark buffer dirty before writing a WAL record */
@@ -1342,8 +1348,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
+				if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
@@ -1949,8 +1954,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1975,7 +1985,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2402,6 +2412,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2412,6 +2423,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; tupindex < dead_tuples->num_tuples; tupindex++)
@@ -2442,7 +2456,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd307..6fb0c282486 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -249,6 +250,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -272,12 +274,19 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
+		/*
+		 * Can reach here from VACUUM or from startup process, so need not have an
+		 * XID.
+		 */
+		if (needwal && XLogRecPtrIsInvalid(recptr))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -474,6 +483,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -487,8 +497,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -516,7 +531,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index c88dc6eedbd..9ed8039d730 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -235,6 +236,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 37ee0b4d6ee..b95f56ae97e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "common/pg_prng.h"
 #include "lib/qunique.h"
@@ -1241,6 +1242,7 @@ _bt_insertonpg(Relation rel,
 		Page		metapg = NULL;
 		BTMetaPageData *metad = NULL;
 		BlockNumber blockcache;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		/*
 		 * If we are doing this insert because we split a page that was the
@@ -1266,6 +1268,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1304,7 +1309,7 @@ _bt_insertonpg(Relation rel,
 		}
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
@@ -1489,6 +1494,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	bool		newitemonleft,
 				isleaf,
 				isrightmost;
+	bool		needwal;
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1916,13 +1922,18 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -1959,7 +1970,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_split xlrec;
 		uint8		xlinfo;
@@ -2447,6 +2458,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
@@ -2484,6 +2496,10 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
@@ -2541,7 +2557,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_newroot xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5bc7c3616a9..b4fb0a63091 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1184,6 +1195,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1314,6 +1328,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2098,6 +2114,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2186,6 +2203,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2237,7 +2258,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2324,6 +2345,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2555,6 +2577,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2631,7 +2657,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index e7afb2c242a..aa262d9fa4c 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "common/pg_prng.h"
 #include "miscadmin.h"
@@ -215,6 +216,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -459,6 +462,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1132,6 +1137,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1540,6 +1547,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1626,6 +1635,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1811,6 +1822,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c42..b3f00f28b16 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc2..d0ae4ec1696 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1162,6 +1163,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2951,7 +2954,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 28b153abc3c..561b67bb712 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1164,6 +1165,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MyProc->delayChkpt = true;
@@ -2265,6 +2268,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2363,6 +2369,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e71bd1..58758737dd3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index 3234a1970a7..7acee404c7e 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Private state.
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2d9ad42eca7..a4c979081fe 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1369,6 +1370,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1729,6 +1732,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5db821a7fcd..a3949a7f37c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1069,7 +1069,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*
 	 * Given that we're not in recovery, InsertTimeLineID is set and can't
@@ -2949,9 +2949,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -9581,6 +9583,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -9744,6 +9749,9 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -9800,6 +9808,9 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn)
 	xlrec.overwritten_lsn = aborted_lsn;
 	xlrec.overwrite_time = GetCurrentTimestamp();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -10445,7 +10456,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -10459,10 +10470,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -10484,8 +10495,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 689384a411f..2b4b6040050 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -139,9 +140,20 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
-	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
+	/*
+	 * Cross-check on whether we should be here or not.
+	 *
+	 * This check is primarily for a non-critical section that never insists the
+	 * same WAL write permission check before reaching here.
+	 */
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -219,6 +231,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a49..d429b7bc02f 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -378,8 +379,13 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(rel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -766,8 +772,13 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * (Have to do that here, so we're outside the critical section)
 	 */
 	if (logit && RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -977,8 +988,13 @@ do_setval(Oid relid, int64 next, bool iscalled)
 
 	/* check the comment above nextval_internal()'s equivalent call. */
 	if (RelationNeedsWAL(seqrel))
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 74272dae69e..315fdc4a505 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -944,6 +944,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96a..045f3a48da3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3888,13 +3888,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067d..65bfc0370e3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/freespace.h"
@@ -283,12 +284,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -303,7 +311,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce30..cb78dac718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -847,6 +848,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index d71522cbf3b..41bc221dbfd 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -49,6 +50,49 @@ CounterGetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	Assert(XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 extern bool ProcessBarrierWALProhibit(void);
 extern void MarkCheckPointSkippedInWalProhibitState(void);
 extern void WALProhibitStateCounterInit(bool wal_prohibited);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a30160657..b438ec31fc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -106,6 +106,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -121,6 +145,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -150,6 +175,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v44-0003-Implement-wal-prohibit-state-using-global-barrie.patchapplication/x-patch; name=v44-0003-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 89c24c8f52829b8cf74e43384fb5d85cf5512f78 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v44 3/6] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile      |   1 +
 src/backend/access/transam/walprohibit.c | 433 +++++++++++++++++++++++
 src/backend/access/transam/xact.c        |  36 +-
 src/backend/access/transam/xlog.c        | 182 +++++++++-
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/variable.c          |   7 +
 src/backend/postmaster/autovacuum.c      |   8 +-
 src/backend/postmaster/bgwriter.c        |   2 +-
 src/backend/postmaster/checkpointer.c    |  37 ++
 src/backend/storage/ipc/ipci.c           |   7 +
 src/backend/storage/ipc/procsignal.c     |  24 +-
 src/backend/storage/lmgr/lock.c          |   6 +-
 src/backend/storage/sync/sync.c          |  31 +-
 src/backend/tcop/utility.c               |   1 +
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/misc/guc.c             |  27 ++
 src/bin/pg_controldata/pg_controldata.c  |   2 +
 src/include/access/walprohibit.h         |  60 ++++
 src/include/access/xlog.h                |  12 +
 src/include/catalog/pg_control.h         |   3 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/postmaster/bgwriter.h        |   2 +
 src/include/storage/procsignal.h         |   7 +-
 src/include/utils/wait_event.h           |   3 +-
 src/tools/pgindent/typedefs.list         |   1 +
 25 files changed, 828 insertions(+), 73 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de722..b5322a69954 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 00000000000..3234a1970a7
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,433 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Private state.
+ */
+static bool WALProhibitStateChangeIsInProgress;
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static void CompleteWALProhibitChange(void);
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState() == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ *
+ *	NB: Function always returns true that leaves scope for the future code
+ *	changes might need to return false for some reason.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState())
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			break;
+	}
+
+	wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state =
+			CounterGetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		ProcessWALProhibitStateChangeRequest();
+		PG_RETURN_BOOL(true);
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_BOOL(true);		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as read-only */
+	return (GetWALProhibitState() != WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * CompleteWALProhibitChange()
+ *
+ *	Complete WAL prohibit state transition.
+ *
+ *	Based on the final WAL prohibited state to be transit, the in-memory state
+ *	update decided to do before or after emitting global barrier.
+ *
+ *	The idea behind this is that when we say the system is WAL prohibited, then
+ *	WAL writes in all the backend should be prohibited, but when the system is
+ *	no longer WAL prohibited, then it is not necessary to take out all backend
+ *	from WAL prohibited state.  No harm if we let those backend run as read-only
+ *	for some more time until we emit the barrier since those might have
+ *	connected when the system was in WAL prohibited state and might doing a
+ *	read-only operation. Those who might connect now onward can immediately
+ *	start read-write operations.
+ *
+ *	Therefore, while moving the system to WAL is no longer prohibited, then set
+ *	update system state immediately and emit barrier later. But, while moving
+ *	the system to WAL prohibited then we emit the global barrier first to ensure
+ *	that no backend do the WAL writes before we set system state to WAL
+ *	prohibited.
+ */
+static void
+CompleteWALProhibitChange(void)
+{
+	uint64		barrier_gen;
+	bool		wal_prohibited;
+
+	/* Fetch shared wal prohibit state */
+	WALProhibitState cur_state = GetWALProhibitState();
+
+	/* Should be here only in transition state */
+	if (cur_state == WALPROHIBIT_STATE_READ_WRITE ||
+		cur_state == WALPROHIBIT_STATE_READ_ONLY)
+		return;
+
+	wal_prohibited = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * If the server is started in wal prohibited state then the
+	 * required wal write operation in the startup process to
+	 * start the server normally has been skipped, if it is, then
+	 * does that right away.
+	 */
+	if (!wal_prohibited &&
+		GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE)
+		PerformPendingXLogAcceptWrites();
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(wal_prohibited);
+
+	/* Going out of WAL prohibited state then update state right away. */
+	if (!wal_prohibited)
+	{
+		uint32		wal_prohibit_counter PG_USED_FOR_ASSERTS_ONLY;
+
+		/* The operation to allow wal writes should be done by now  */
+		Assert(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE);
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/*
+		 * Should have set counter for the final state where wal is no longer
+		 * prohibited.
+		 */
+		Assert(CounterGetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+	}
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state as well.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	/*
+	 * Increment wal prohibit state counter in share memory once the barrier has
+	 * been processed by all the backend that ensures that all backends are in
+	 * wal prohibited state.
+	 */
+	if (wal_prohibited)
+	{
+		uint32		wal_prohibit_counter PG_USED_FOR_ASSERTS_ONLY;
+
+		wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+		/* Should have set counter for the final wal prohibited state */
+		Assert(CounterGetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+	}
+
+	if (wal_prohibited)
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	else
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+
+	/* Wake up the backend waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	/*
+	 * Must be called by the checkpointer process or single-user backend.
+	 */
+	if (!(AmCheckpointerProcess() || !IsPostmasterEnvironment))
+		return;
+
+	/*
+	 * Quick exit if the state transition is already in progress to avoid a
+	 * recursive call to process wal prohibit state transition in some case e.g.
+	 * the end-of-recovery checkpoint.
+	 */
+	if (WALProhibitStateChangeIsInProgress)
+		return;
+
+	WALProhibitStateChangeIsInProgress = true;
+
+	PG_TRY();
+	{
+		CompleteWALProhibitChange();
+	}
+	PG_FINALLY();
+	{
+		WALProhibitStateChangeIsInProgress = false;
+	}
+	PG_END_TRY();
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * GetWALProhibitState()
+ */
+WALProhibitState
+GetWALProhibitState(void)
+{
+	return CounterGetWALProhibitState(GetWALProhibitCounter());
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d8..2d9ad42eca7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2014,23 +2014,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 98af35b6cf3..5db821a7fcd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -237,9 +238,10 @@ static bool LocalPromoteIsTriggered = false;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -746,6 +748,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * SharedXLogAllowWritesState indicates the state of the last recovery
+	 * checkpoint and required wal write to start the normal server.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -5064,6 +5072,16 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+	void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -5340,6 +5358,7 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_PENDING;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -6504,6 +6523,15 @@ SetLatestXTime(TimestampTz xtime)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * Fetch timestamp of latest processed commit/abort record.
  */
@@ -6876,13 +6904,29 @@ StartupXLOG(void)
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			UpdateControlFile();
+
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -8192,8 +8236,29 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 
-	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites();
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_SKIPPED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		promoted = XLogAcceptWrites();
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8262,6 +8327,20 @@ XLogAcceptWrites(void)
 	 */
 	volatile XLogCtlData *xlogctl = XLogCtl;
 
+	/*
+	 * If required wal writes to start server normally are performed already
+	 * then we are done.
+	 */
+	if (GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_DONE)
+		return promoted;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
+
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
@@ -8311,9 +8390,38 @@ XLogAcceptWrites(void)
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * Spinlock protection isn't needed since only one process will be updating
+	 * this value at a time.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	Assert(AmCheckpointerProcess());
+	Assert(GetXLogWriteAllowedState() != XLOG_ACCEPT_WRITES_DONE);
+
+	/* Prepare to accept WAL writes. */
+	(void) XLogAcceptWrites();
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+ }
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -8454,6 +8562,11 @@ PerformRecoveryXLogAction(void)
 		 */
 		CreateEndOfRecoveryRecord();
 	}
+	else if (AmCheckpointerProcess())
+	{
+		/* In checkpointer process, just do it ourselves */
+		CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+	}
 	else
 	{
 		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
@@ -8582,9 +8695,9 @@ HotStandbyActiveInReplay(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -8603,9 +8716,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -8632,6 +8756,12 @@ LocalSetXLogInsertAllowed(void)
 	return oldXLogAllowed;
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -8933,9 +9063,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -8948,6 +9082,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -9198,8 +9335,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Prepare to accumulate statistics.
@@ -9648,10 +9790,12 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn)
 {
 	xl_overwrite_contrecord xlrec;
 	XLogRecPtr	recptr;
+	XLogAcceptWritesState state = GetXLogWriteAllowedState();
 
 	/* sanity check */
-	if (!RecoveryInProgress())
-		elog(ERROR, "can only be used at end of recovery");
+	if (state != XLOG_ACCEPT_WRITES_PENDING &&
+		state != XLOG_ACCEPT_WRITES_SKIPPED)
+		elog(ERROR, "can only be used at enabling WAL writes");
 
 	xlrec.overwritten_lsn = aborted_lsn;
 	xlrec.overwrite_time = GetCurrentTimestamp();
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index f6789025a5f..cc6d81cfff2 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -707,6 +707,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_logicalmapdir() FROM PUBLIC;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_replslotdir(text) FROM PUBLIC;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 0c85679420c..833c7f5139b 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a9223e7b108..824e762b684 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -700,10 +700,12 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the wal writes are permitted.  Second, we
+		 * need to make sure that there is a worker slot available.  Third, we
+		 * need to make sure that no other worker failed while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..e869a004aa9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -275,7 +275,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d0..74272dae69e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
@@ -339,6 +340,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		WALProhibitState cur_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -348,6 +350,22 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
+
+		/* Should be in WAL permitted state to perform the checkpoint */
+		cur_state = GetWALProhibitState();
+		if (cur_state != WALPROHIBIT_STATE_READ_WRITE)
+		{
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write state.
+			 */
+			if (cur_state == WALPROHIBIT_STATE_READ_ONLY)
+				(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+								 -1, WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+			continue;
+		}
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -692,6 +710,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1341,3 +1362,19 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows any process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9fa3e0631e6..9b391cb9cc2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -247,6 +248,12 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up shared memory structure need to handle concurrent WAL prohibit
+	 * state change requests.
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 6e69398cdda..762d5970f44 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -101,7 +102,6 @@ static ProcSignalSlot *MyProcSignalSlot = NULL;
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
-static bool ProcessBarrierPlaceholder(void);
 
 /*
  * ProcSignalShmemSize
@@ -527,8 +527,8 @@ ProcessProcSignalBarrier(void)
 				type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 				switch (type)
 				{
-					case PROCSIGNAL_BARRIER_PLACEHOLDER:
-						processed = ProcessBarrierPlaceholder();
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
 						break;
 				}
 
@@ -594,24 +594,6 @@ ResetProcSignalBarrierBits(uint32 flags)
 	InterruptPending = true;
 }
 
-static bool
-ProcessBarrierPlaceholder(void)
-{
-	/*
-	 * XXX. This is just a placeholder until the first real user of this
-	 * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to
-	 * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something
-	 * appropriately descriptive. Get rid of this function and instead have
-	 * ProcessBarrierSomethingElse. Most likely, that function should live in
-	 * the file pertaining to that subsystem, rather than here.
-	 *
-	 * The return value should be 'true' if the barrier was successfully
-	 * absorbed and 'false' if not. Note that returning 'false' can lead to
-	 * very frequent retries, so try hard to make that an uncommon case.
-	 */
-	return true;
-}
-
 /*
  * CheckProcSignal - check to see if a particular reason has been
  * signaled, and clear the signal flag.  Should be called after receiving
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c25af7fe090..b595c0db1bd 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56f..b27625f4845 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -241,9 +242,17 @@ SyncPostCheckpoint(void)
 		entry->canceled = true;
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
-		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop.
+		 * As in ProcessSyncRequests, we don't want to stop processing wal
+		 * prohibit change requests for a long time when there are many
+		 * deletions to be done.  It needs to be check and processed by
+		 * checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -302,6 +311,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -360,6 +372,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop processing wal prohibit change requests for a long
+		 * time when there are many fsync requests to be processed.  It needs to
+		 * be check and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -446,6 +465,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the same
+				 * function call.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d471..7bc3bd369b1 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4d53f040e81..60e5b985d00 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfda..67ea808c4b9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
@@ -235,6 +236,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -677,6 +679,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2119,6 +2122,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12572,4 +12587,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d946..e4d99a50c06 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 00000000000..d71522cbf3b
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,60 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+CounterGetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+extern WALProhibitState GetWALProhibitState(void);
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 898df2ee034..c8685142e56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,14 @@ typedef enum WalCompression
 	WAL_COMPRESSION_LZ4
 } WalCompression;
 
+/* State of XLogAcceptWrites() execution */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_PENDING = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_SKIPPED,		/* skipped XLogAcceptWrites() */
+	XLOG_ACCEPT_WRITES_DONE			/* done with XLogAcceptWrites() */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -279,6 +287,7 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
@@ -287,8 +296,10 @@ extern RecoveryPauseState GetRecoveryPauseState(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void UpdateControlFile(void);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 extern uint64 GetSystemIdentifier(void);
 extern char *GetMockAuthenticationNonce(void);
 extern bool DataChecksumsEnabled(void);
@@ -299,6 +310,7 @@ extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 749bce0cc6f..19cf88d24ba 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -184,6 +184,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e934361dc32..d4b9c308863 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11671,6 +11671,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4545', descr => 'change server to permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'bool',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index c430b1b2366..bee495f05da 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2ee..227adf8eeeb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,12 +49,7 @@ typedef enum
 
 typedef enum
 {
-	/*
-	 * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first
-	 * real user of the ProcSignalBarrier mechanism is added. It's just here
-	 * for now because we can't have an empty enum.
-	 */
-	PROCSIGNAL_BARRIER_PLACEHOLDER = 0
+	PROCSIGNAL_BARRIER_WALPROHIBIT = 0
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 8785a8e12c1..22db80c81c1 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -225,7 +225,8 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index da6ac8ed83e..60622118874 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2834,6 +2834,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v44-0001-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/x-patch; name=v44-0001-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 67b3cad9a4601466c059f0bc3283b182c9511d20 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v44 1/6] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 112 +++++++++++++++++-------------
 1 file changed, 65 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b980c6ac21c..210b982cc56 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -939,6 +939,8 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
+static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+							 TimeLineID newTLI);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
@@ -8134,53 +8136,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
-		missingContrecPtr = InvalidXLogRecPtr;
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 *
-	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
-	 * entered recovery. Even if we ultimately replayed no WAL records, it will
-	 * have been initialized based on where replay was due to start.  We don't
-	 * need a lock to access this, since this can't change any more by the time
-	 * we reach this code.
-	 */
-	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(EndOfLogTLI, EndOfLog, newTLI);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -8235,6 +8192,67 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+				 TimeLineID newTLI)
+{
+	bool		promoted = false;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr);
+		abortedRecPtr = InvalidXLogRecPtr;
+		missingContrecPtr = InvalidXLogRecPtr;
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * XLogCtl->lastReplayedEndRecPtr will be a valid LSN if and only if we
+	 * entered recovery. Even if we ultimately replayed no WAL records, it will
+	 * have been initialized based on where replay was due to start.  We don't
+	 * need a lock to access this, since this can't change any more by the time
+	 * we reach this code.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
-- 
2.18.0

#195

Amul Sul

sulamul@gmail.com

almost 4 years ago

In reply to: Amul Sul (#194)

6 attachment(s)

Re: [Patch] ALTER SYSTEM READ ONLY

Attached is rebase version for the latest maste head(#891624f0ec).

0001 and 0002 patch is changed a bit due to xlog.c refactoring
commit(#70e81861), needing a bit more thought to copy global variables into
right shared memory structure. Also, I made some changes to the 0003
patch to avoid
XLogAcceptWrites() entrancing suggested in offline discussion.

Regards,
Amul

Attachments:

v45-0005-Documentation.patchapplication/octet-stream; name=v45-0005-Documentation.patchDownload

From ec3d2fcd9d4839a430509f514cbbe2c21e52870c Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Tue, 14 Jul 2020 02:30:44 -0400
Subject: [PATCH v45 5/6] Documentation.

---
 doc/src/sgml/func.sgml              | 20 ++++++++++
 doc/src/sgml/high-availability.sgml | 34 ++++++++++++++++
 doc/src/sgml/monitoring.sgml        |  4 ++
 src/backend/access/transam/README   | 60 ++++++++++++++++++++++++++---
 src/backend/storage/page/README     | 12 +++---
 5 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5047e090db..297fe6593c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28092,6 +28092,26 @@ SELECT collation for ('foo' COLLATE "de_DE");
         <literal>false</literal> is returned.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_prohibit_wal</primary>
+        </indexterm>
+        <function>pg_prohibit_wal</function> ()
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Function accepts a boolean argument to alter the WAL read-write state
+        and forces all processes of the <productname>PostgreSQL</productname>
+        server to accept that state change immediately. When
+        <literal>true</literal> passed, system state changed to WAL prohibited
+        state where wal writes are restricted, if that not already.
+        When <literal>false</literal> passed, system state changed to WAL
+        permitted state where WAL writes are allowed, if that not already. See
+        <xref linkend="wal-prohibited-state"/> for more details.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b0a653373d..c450767698 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2340,4 +2340,38 @@ HINT:  You can then restart the server after making the necessary configuration
 
  </sect1>
 
+ <sect1 id="wal-prohibited-state">
+  <title>WAL Prohibited State</title>
+
+  <indexterm zone="high-availability">
+   <primary>WAL Prohibited State</primary>
+  </indexterm>
+
+   <para>
+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot
+    Standby, connections to the server are allowed to run read-only queries
+    in WAL prohibited state. If the system is in WAL prohibited state then GUC
+    <literal>wal_prohibited</literal> value will be <literal>on</literal>.
+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated. This is useful for HA setup where the master server
+    needs to stop accepting WAL writes immediately and kick out any
+    transaction expecting WAL writes at the end, in case of network down on
+    master or replication connections failures.
+   </para>
+
+   <para>
+    Shutting down the WAL prohibited system will skip the shutdown checkpoint,
+    and at the restart, it will go into crash recovery mode and stay in that
+    state until the system changed to read-write.  At starting WAL prohibited
+    server if it finds <filename>standby.signal</filename> or
+    <filename>recovery.signal</filename> file then system implicitly get out of
+    WAL prohibited state.
+   </para>
+ </sect1>
 </chapter>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2f44113caa..b79e3353d6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1569,6 +1569,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>SystemWALProhibitStateChange</literal></entry>
+      <entry>Waiting for a wal prohibited state change.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..24dca70a6c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -442,8 +442,8 @@ to be modified.
 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 PANIC because the shared buffers will contain unlogged changes, which we
 have to ensure don't get to disk.  Obviously, you should check conditions
-such as whether there's enough free space on the page before you start the
-critical section.)
+such as whether there's WAL write permission and enough free space on the page
+before you start the critical section.)
 
 3. Apply the required changes to the shared buffer(s).
 
@@ -486,6 +486,54 @@ with the incomplete-split flag set, it will finish the interrupted split by
 inserting the key to the parent, before proceeding.
 
 
+WAL prohibited system state
+----------------------
+
+The system state when it is not currently possible to insert write ahead log
+records, either because the system is still in recovery or because the system
+forced to WAL prohibited by executing pg_prohibit_wal() function.  We have a
+lower-level defense in XLogBeginInsert() and elsewhere to stop us from modifying
+data during recovery when !XLogInsertAllowed(), but if XLogBeginInsert() is
+inside the critical section we must not depend on it to report an error.
+Otherwise, it will cause PANIC as mentioned previously.
+
+We do not reach the point where we try to write WAL during recovery but
+pg_prohibit_wal() can be executed anytime by the user to stop WAL writing.  Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.  The transaction which doesn't acquire
+valid XID yet or operation such VACUUM or CONCURRENT CREATE INDEX which not
+necessary have valid XID for WAL will not be prevented while barrier processing,
+and those might hit the error from XLogBeginInsert() while trying to write WAL
+in WAL prohibited system state.  To prevent such error from XLogBeginInsert()
+inside the critical section the WAL write permission has to check before
+START_CRIT_SECTION().
+
+To enforce the practice to check WAL permission before entering into critical
+section for the WAL write, we have added an assert check flag that indicates
+permission has been checked before calling XLogBeginInsert().  If not,
+XLogBeginInsert() will have assertion failure.  WAL permission check is not
+mandatory if the XLogBeginInsert() is not inside the critical section where
+throwing the error is acceptable.  To get permission check flag set either
+CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+should be called before START_CRIT_SECTION().  This flag automatically resets
+while exiting from the critical section.  The rule to place either of permission
+check routines will be:
+
+	The places where WAL write operation in critical can be expected without
+	having valid XID (e.g vacuum) need to protect by CheckWALPermitted(), so
+	that error can be reported outside before critical section.
+
+	The places where INSERT and UPDATE are expected which are never happened
+	without valid XID can be checked using AssertWALPermittedHaveXID().  So that
+	non-assert build will not have the checking overhead.
+
+	The places we know that we cannot be reached in the WAL prohibited state and
+	may or may not have XID, but need to ensure the permission has been checked
+	on assert enabled build should use AssertWALPermitted().
+
+
 Constructing a WAL record
 -------------------------
 
@@ -531,7 +579,8 @@ Details of the API functions:
 
 void XLogBeginInsert(void)
 
-    Must be called before XLogRegisterBuffer and XLogRegisterData.
+    Must be called before XLogRegisterBuffer and XLogRegisterData.  WAL
+    permission must be check before calling it in a critical section.
 
 void XLogResetInsertion(void)
 
@@ -638,8 +687,9 @@ MarkBufferDirtyHint() to mark the block dirty.
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
-write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+write the dirtied page. WAL is not written while in read only (i.e. during
+recovery or in WAL prohibit state), so we simply skip dirtying blocks because of
+hints when in recovery.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59a..15f0bb4b7b 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -56,9 +56,9 @@ WAL is a fatal error and prevents further recovery, whereas a checksum failure
 on a normal data block is a hard error but not a critical one for the server,
 even if it is a very bad thing for the user.
 
-New WAL records cannot be written during recovery, so hint bits set during
-recovery must not dirty the page if the buffer is not already dirty, when
-checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+New WAL records cannot be written during recovery or or while in WAL prohibit
+state, so hint bits set during recovery must not dirty the page if the buffer is
+not already dirty, when checksums are enabled.  Systems in Hot-Standby mode may
+benefit from hint bits being set, but with checksums enabled, a page cannot be
+dirtied after setting a hint bit (due to the torn page risk). So, it must wait
+for full-page images containing the hint bit updates to arrive from the primary.
-- 
2.18.0

v45-0006-Test-Few-tap-tests-for-wal-prohibited-system.patchapplication/octet-stream; name=v45-0006-Test-Few-tap-tests-for-wal-prohibited-system.patchDownload

From 64ff08a76e6b9ed61f8b5cd51cd86d619bdd4a43 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 27 Aug 2021 08:18:40 -0400
Subject: [PATCH v45 6/6] Test: Few tap tests for wal prohibited system

Does following testing:

1. Basic verification like insert into normal and unlogged table on
   wal prohibited system.
2. Check permission to non-superuser to alter wal prohibited system
   state.
3. Verify open write transaction disconnection when system state has
   been changed to wal prohibited.
4. Verify wal write and checkpoint lsn after restart of wal prohibited
   system doesn't change along with wal prohibited state.
5. At restart wal prohibited system shutdown and on start recovery end
   checkpoint is skipped, verify implicit checkpoint perform when
   system state changes to wal permitted.
6. Standby server cannot be in wal prohibited, standby.signal and/or
   recovery.signal take out system from wal prohibited state.
7. Terminate session running transaction performed write but not
   committed yet while changing state to WAL prohibited.
8. Changes 026_overwrite_contrecord.pl test to check with WAL
   prohibited system. (XXX: Should make copy of this file for WAL
   prohibited testing, I think, not needed).
---
 .../recovery/t/026_overwrite_contrecord.pl    |  11 +-
 src/test/recovery/t/032_pg_prohibit_wal.pl    | 216 ++++++++++++++++++
 2 files changed, 223 insertions(+), 4 deletions(-)
 create mode 100644 src/test/recovery/t/032_pg_prohibit_wal.pl

diff --git a/src/test/recovery/t/026_overwrite_contrecord.pl b/src/test/recovery/t/026_overwrite_contrecord.pl
index 78feccd9aa..34082568b4 100644
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -63,10 +63,11 @@ my $endfile = $node->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_insert_lsn())');
 ok($initfile ne $endfile, "$initfile differs from $endfile");
 
-# Now stop abruptly, to avoid a stop checkpoint.  We can remove the tail file
-# afterwards, and on startup the large message should be overwritten with new
-# contents
-$node->stop('immediate');
+# Change system to wal prohibited that will skip shutdown checkpoint.  We can
+# remove the tail file afterwards, and on startup the large message should be
+# overwritten with new contents
+$node->safe_psql('postgres', qq{SELECT pg_prohibit_wal(true)});
+$node->stop;
 
 unlink $node->basedir . "/pgdata/pg_wal/$endfile"
   or die "could not unlink " . $node->basedir . "/pgdata/pg_wal/$endfile: $!";
@@ -79,6 +80,8 @@ $node_standby->init_from_backup($node, 'backup', has_streaming => 1);
 $node_standby->start;
 $node->start;
 
+# Change system to wal permitted now.
+$node->safe_psql('postgres', qq{SELECT pg_prohibit_wal(false)});
 $node->safe_psql('postgres',
 	qq{create table foo (a text); insert into foo values ('hello')});
 $node->safe_psql('postgres',
diff --git a/src/test/recovery/t/032_pg_prohibit_wal.pl b/src/test/recovery/t/032_pg_prohibit_wal.pl
new file mode 100644
index 0000000000..c3491441c7
--- /dev/null
+++ b/src/test/recovery/t/032_pg_prohibit_wal.pl
@@ -0,0 +1,216 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test wal prohibited state.
+use strict;
+use warnings;
+use FindBin;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 22;
+
+# Query to read wal_prohibited GUC
+my $show_wal_prohibited_query = "SELECT current_setting('wal_prohibited')";
+
+# Initialize database node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Create few tables and insert some data
+$node_primary->safe_psql('postgres',  <<EOSQL);
+CREATE TABLE tab AS SELECT 1 AS i;
+CREATE UNLOGGED TABLE unlogtab AS SELECT 1 AS i;
+EOSQL
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is now wal prohibited');
+
+#
+# In wal prohibited state, further table insert will fail.
+#
+# Note that even though inter into unlogged and temporary table doesn't generate
+# wal but the transaction does that insert operation will acquire transaction id
+# which is not allowed on wal prohibited system. Also, that transaction's abort
+# or commit state will be wal logged at the end which is prohibited as well.
+#
+my ($stdout, $stderr, $timed_out);
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(2)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, table insert is failed');
+$node_primary->psql('postgres', 'INSERT INTO unlogtab VALUES(2)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'server is wal prohibited, unlogged table insert is failed');
+
+# Get current wal write and latest checkpoint lsn
+my $write_lsn = $node_primary->lsn('write');
+my $checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+
+# Restart the server, shutdown and starup checkpoint will be skipped.
+$node_primary->restart;
+
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is wal prohibited after restart too');
+is($node_primary->lsn('write'), $write_lsn,
+	"no wal writes on server, last wal write lsn : $write_lsn");
+is(get_latest_checkpoint_location($node_primary), $checkpoint_lsn,
+	"no new checkpoint, last checkpoint lsn : $checkpoint_lsn");
+
+# Change server to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'server is change to wal permitted');
+
+my $new_checkpoint_lsn = get_latest_checkpoint_location($node_primary);
+ok($new_checkpoint_lsn ne $checkpoint_lsn,
+	"new checkpoint performed, new checkpoint lsn : $new_checkpoint_lsn");
+
+my $new_write_lsn = $node_primary->lsn('write');
+ok($new_write_lsn ne $write_lsn,
+	"new wal writes on server, new latest wal write lsn : $new_write_lsn");
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(2)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '2',
+	'table insert passed');
+
+# Only the superuser and the user who granted permission able to call
+# pg_prohibit_wal to change wal prohibited state.
+$node_primary->safe_psql('postgres', 'CREATE USER non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+like($stderr, qr/permission denied for function pg_prohibit_wal/,
+	'permission denied to non-superuser for alter wal prohibited state');
+$node_primary->safe_psql('postgres', 'GRANT EXECUTE ON FUNCTION pg_prohibit_wal TO non_superuser');
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+	stdout => \$stdout, stderr => \$stderr, extra_params => [ '-U', 'non_superuser' ]);
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'granted permission to non-superuser, able to alter wal prohibited state');
+
+# back to normal state
+$node_primary->psql('postgres', 'SELECT pg_prohibit_wal(false)');
+
+my $psql_timeout = IPC::Run::timer(60);
+my ($rw_session_stdin, $rw_session_stdout, $rw_session_stderr) = ('', '', '');
+my $rw_session = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$rw_session_stdin,
+	'>',
+	\$rw_session_stdout,
+	'2>',
+	\$rw_session_stderr,
+	$psql_timeout);
+
+# Write in transaction and get backend pid
+$rw_session_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(4);
+SELECT $$value-4-inserted-into-tab$$;
+];
+ok(pump_until($rw_session, $psql_timeout, \$rw_session_stdout,
+		qr/value-4-inserted-into-tab/m),
+	"started write transaction in a session");
+$rw_session_stdout = '';
+$rw_session_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$rw_session_stdin .= q[
+COMMIT;
+];
+ok(pump_until($rw_session, $psql_timeout, \$rw_session_stderr,
+		qr/FATAL:  WAL is now prohibited|server closed the connection unexpectedly|connection to server was lost|could not send data to server/m),
+	"session with open write transaction is terminated");
+
+# Now stop the primary server in WAL prohibited state and take filesystem level
+# backup and set up new server from it.
+$node_primary->stop;
+my $backup_name = 'my_backup';
+$node_primary->backup_fs_cold($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name);
+$node_standby->start;
+
+# The primary server is stopped in wal prohibited state, the filesystem level
+# copy also be in wal prohibited state
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+	'new server created using backup of a stopped primary is also wal prohibited');
+
+# Start Primary
+$node_primary->start;
+
+# Set the new server as standby of primary.
+# enable_streaming will create standby.signal file which will take out system
+# from wal prohibited state.
+$node_standby->enable_streaming($node_primary);
+$node_standby->restart;
+
+# Check if the new server has been taken out from the wal prohibited state.
+is($node_standby->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'new server as standby is no longer wal prohibited');
+
+# Recovery server cannot be put into wal prohibited state.
+$node_standby->psql('postgres', 'SELECT pg_prohibit_wal(true)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute pg_prohibit_wal\(\) during recovery/,
+	'standby server state cannot be changed to wal prohibited');
+
+# Primary is still in wal prohibited state, the further insert will fail.
+$node_primary->psql('postgres', 'INSERT INTO tab VALUES(3)',
+          stdout => \$stdout, stderr => \$stderr);
+like($stderr, qr/cannot execute INSERT in a read-only transaction/,
+	'primary server is wal prohibited, table insert is failed');
+
+# Change primary to WAL permitted
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(false)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query),
+	'off', 'primary server is change to wal permitted');
+
+# Insert data
+$node_primary->safe_psql('postgres', 'INSERT INTO tab VALUES(3)');
+is($node_primary->safe_psql('postgres', 'SELECT count(i) FROM tab'), '3',
+	'insert passed on primary');
+
+# Wait for standbys to catch up
+$node_primary->wait_for_catchup($node_standby, 'write');
+is($node_standby->safe_psql('postgres', 'SELECT count(i) FROM tab'), '3',
+	'new insert replicated on standby as well');
+
+
+#
+# Get latest checkpoint lsn from control file
+#
+sub get_latest_checkpoint_location
+{
+	my ($node) = @_;
+	my $data_dir = $node->data_dir;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $data_dir ]);
+	my @control_data = split("\n", $stdout);
+
+	my $latest_checkpoint_lsn = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint_lsn = $1;
+			last;
+		}
+	}
+	die "No latest checkpoint location in control file found\n"
+	unless defined($latest_checkpoint_lsn);
+
+	return $latest_checkpoint_lsn;
+}
-- 
2.18.0

v45-0002-Remove-dependencies-on-startup-process-specifica.patchapplication/octet-stream; name=v45-0002-Remove-dependencies-on-startup-process-specifica.patchDownload

From 37b9c2749d77e94cb41b4a7ca7a54ceca65a7d57 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Wed, 6 Apr 2022 01:00:08 -0400
Subject: [PATCH v45 2/6] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr, missingContrecPtr &
ArchiveRecoveryRequested is made accessible by copying into shared
memory.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Both of these are also exported into
shared memory since non of the existing shared memory variable matches
exactly with these values.

Also, make sure to use a volatile pointer to access XLogCtl to read
the latest shared variable values.
---
 src/backend/access/transam/xlog.c         | 124 ++++++++++++++++------
 src/backend/access/transam/xlogrecovery.c |  36 ++++++-
 src/include/access/xlogrecovery.h         |   1 +
 3 files changed, 128 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e7e99a850..19a499e4e6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -552,6 +552,24 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that. For the same reason
+	 * SharedMissingContrecPtr exports missingContrecPtr.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+	XLogRecPtr	SharedMissingContrecPtr;
+
+	/*
+	 * Determines an endpoint that we consider a valid portion of WAL when
+	 * server startup.  It is invalid during recovery and does not change once
+	 * set.
+	 */
+	XLogRecPtr	endOfLog;
+	TimeLineID	endOfLogTLI;
+
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -634,9 +652,7 @@ static bool holdingAllLocks = false;
 static MemoryContext walDebugCxt = NULL;
 #endif
 
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-										XLogRecPtr EndOfLog,
-										TimeLineID newTLI);
+static void CleanupAfterArchiveRecovery();
 static void CheckRequiredParameterValues(void);
 static void XLogReportParameters(void);
 static int	LocalSetXLogInsertAllowed(void);
@@ -665,10 +681,7 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static bool XLogAcceptWrites(bool performedWalRecovery, TimeLineID newTLI,
-							 TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							 XLogRecPtr abortedRecPtr,
-							 XLogRecPtr missingContrecPtr);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
@@ -4746,9 +4759,17 @@ XLogInitNewTimeline(TimeLineID endTLI, XLogRecPtr endOfLog, TimeLineID newTLI)
  * Perform cleanup actions at the conclusion of archive recovery.
  */
 static void
-CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-							TimeLineID newTLI)
+CleanupAfterArchiveRecovery()
 {
+	/*
+	 * Use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	XLogRecPtr	EndOfLog = xlogctl->endOfLog;
+	TimeLineID	EndOfLogTLI = xlogctl->endOfLogTLI;
+
 	/*
 	 * Execute the recovery_end_command, if any.
 	 */
@@ -4766,7 +4787,7 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
 	 * pre-allocated files containing garbage. In any case, they are not part
 	 * of the new timeline's history so we don't need them.
 	 */
-	RemoveNonParentXlogFiles(EndOfLog, newTLI);
+	RemoveNonParentXlogFiles(EndOfLog, xlogctl->InsertTimeLineID);
 
 	/*
 	 * If the switch happened in the middle of a segment, what to do with the
@@ -4889,7 +4910,6 @@ StartupXLOG(void)
 	XLogRecPtr	EndOfLog;
 	TimeLineID	EndOfLogTLI;
 	TimeLineID	newTLI;
-	bool		performedWalRecovery;
 	EndOfWalRecoveryInfo *endOfRecoveryInfo;
 	XLogRecPtr	abortedRecPtr;
 	XLogRecPtr	missingContrecPtr;
@@ -5293,10 +5313,24 @@ StartupXLOG(void)
 		 * We're all set for replaying the WAL now. Do it.
 		 */
 		PerformWalRecovery();
-		performedWalRecovery = true;
+
+		/*
+		 * Redo apply position will be checked to find out whether we entered
+		 * wal recovery performed or not because it will be a valid LSN if and
+		 * only if we entered recovery. Even if we ultimately replayed no WAL
+		 * records, it will have been initialized based on where the replay was
+		 * due to start.
+		 */
+		Assert(!XLogRecPtrIsInvalid(GetXLogReplayRecPtr(NULL)));
 	}
 	else
-		performedWalRecovery = false;
+	{
+		/*
+		 * Redo apply position will be an invalid LSN if we haven't entered
+		 * recovery.
+		 */
+		Assert(XLogRecPtrIsInvalid(GetXLogReplayRecPtr(NULL)));
+	}
 
 	/*
 	 * Finish WAL recovery.
@@ -5441,6 +5475,15 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+
+		/*
+		 * Remember broken record pointer in shared memory state. This process
+		 * might unable to write an OVERWRITE_CONTRECORD message because of WAL
+		 * write restriction.  Storing in shared memory helps that get written
+		 * later by another process as soon as WAL writing is enabled.
+		 */
+		XLogCtl->SharedAbortedRecPtr = abortedRecPtr;
+		XLogCtl->SharedMissingContrecPtr = missingContrecPtr;
 	}
 
 	/*
@@ -5506,6 +5549,13 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
+	/*
+	 * Store EndOfLog and EndOfLogTLI into shared memory to share with other
+	 * processes.
+	 */
+	XLogCtl->endOfLog = EndOfLog;
+	XLogCtl->endOfLogTLI = EndOfLogTLI;
+
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
@@ -5531,9 +5581,15 @@ StartupXLOG(void)
 	/* Shut down xlogreader */
 	ShutdownWalRecovery();
 
+	/*
+	 * Update full_page_writes in shared memory, and later whenever wal write
+	 * permitted, write an XLOG_FPW_CHANGE record before resource manager writes
+	 * cleanup WAL records or checkpoint record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+
 	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites(performedWalRecovery, newTLI, EndOfLogTLI,
-								EndOfLog, abortedRecPtr, missingContrecPtr);
+	promoted = XLogAcceptWrites();
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -5592,35 +5648,39 @@ StartupXLOG(void)
  * Prepare to accept WAL writes.
  */
 static bool
-XLogAcceptWrites(bool performedWalRecovery, TimeLineID newTLI,
-				 TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
-				 XLogRecPtr abortedRecPtr, XLogRecPtr missingContrecPtr)
+XLogAcceptWrites(void)
 {
 	bool		promoted = false;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/*
+	 * Use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(xlogctl->SharedAbortedRecPtr))
 	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
+		Assert(!XLogRecPtrIsInvalid(xlogctl->SharedMissingContrecPtr));
+		CreateOverwriteContrecordRecord(xlogctl->SharedAbortedRecPtr,
+										xlogctl->SharedMissingContrecPtr,
+										xlogctl->InsertTimeLineID);
 	}
 
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	/* Write an XLOG_FPW_CHANGE record */
 	UpdateFullPageWrites();
 
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 *
+	 * Redo apply position will be a valid LSN if and only if we entered
+	 * recovery, see comment for assert on same condition in StartupXLOG()
+	 * for more detail.
 	 */
-	if (performedWalRecovery)
+	if (!XLogRecPtrIsInvalid(GetXLogReplayRecPtr(NULL)))
 		promoted = PerformRecoveryXLogAction();
 
 	/*
@@ -5630,8 +5690,8 @@ XLogAcceptWrites(bool performedWalRecovery, TimeLineID newTLI,
 	XLogReportParameters();
 
 	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+	if (ArchiveRecoveryIsRequested())
+		CleanupAfterArchiveRecovery();
 
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
@@ -5739,7 +5799,7 @@ PerformRecoveryXLogAction(void)
 	 * of a full checkpoint. A checkpoint is requested later, after we're
 	 * fully out of recovery mode and already accepting queries.
 	 */
-	if (ArchiveRecoveryRequested && IsUnderPostmaster &&
+	if (ArchiveRecoveryIsRequested() && IsUnderPostmaster &&
 		PromoteIsTriggered())
 	{
 		promoted = true;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2e555f8573..99cad8bcf9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -130,7 +130,10 @@ static TimeLineID curFileTLI;
  * currently performing crash recovery using only XLOG files in pg_wal, but
  * will switch to using offline XLOG archives as soon as we reach the end of
  * WAL in pg_wal.
-*/
+ *
+ * NB: ArchiveRecoveryRequested is exported into shared memory as well to share
+ * with other backends through ArchiveRecoveryIsRequested().
+ */
 bool		ArchiveRecoveryRequested = false;
 bool		InArchiveRecovery = false;
 
@@ -144,6 +147,7 @@ static bool StandbyModeRequested = false;
 bool		StandbyMode = false;
 
 /* was a signal file present at startup? */
+
 static bool standby_signal_file_found = false;
 static bool recovery_signal_file_found = false;
 
@@ -312,6 +316,13 @@ typedef struct XLogRecoveryCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * recoveryWakeupLatch is used to wake up the startup process to continue
 	 * WAL replay, if it is waiting for WAL to arrive or failover trigger file
@@ -1042,6 +1053,11 @@ readRecoverySignalFile(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("standby mode is not supported by single-user servers")));
+
+	/*
+	 * Remember archive recovery request in shared memory state.
+	 */
+	XLogRecoveryCtl->SharedArchiveRecoveryRequested = ArchiveRecoveryRequested;
 }
 
 static void
@@ -4235,6 +4251,24 @@ StartupRequestWalReceiverRestart(void)
 	}
 }
 
+/*
+ * Reads ArchiveRecoveryRequested value from the shared memory.
+ *
+ * ArchiveRecoveryRequested is only valid in the backend that reads the signal
+ * files, and whenever it needs to access this value from other backends, should
+ * use this function.
+ */
+bool
+ArchiveRecoveryIsRequested(void)
+{
+	/*
+	 * Use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogRecoveryCtlData *xlogrecoveryctl = XLogRecoveryCtl;
+
+	return xlogrecoveryctl->SharedArchiveRecoveryRequested;
+}
 
 /*
  * Has a standby promotion already been triggered?
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 75a0f5fe5e..14a838ed3d 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -143,6 +143,7 @@ extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
 extern XLogRecPtr GetCurrentReplayRecPtr(TimeLineID *replayEndTLI);
 
+extern bool ArchiveRecoveryIsRequested(void);
 extern bool PromoteIsTriggered(void);
 extern bool CheckPromoteSignal(void);
 extern void WakeupRecovery(void);
-- 
2.18.0

v45-0003-Implement-wal-prohibit-state-using-global-barrie.patchapplication/octet-stream; name=v45-0003-Implement-wal-prohibit-state-using-global-barrie.patchDownload

From 5160bdcd32436eecc66b5ae524c532e02c1c3c53 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Fri, 19 Jun 2020 06:29:36 -0400
Subject: [PATCH v45 3/6] Implement wal prohibit state using global barrier.

Implementation:

 1. A user tries to change server state to WAL-Prohibited by calling
    pg_prohibit_wal(true) sql function, the current state generation to
    inprogress in shared memory marked and signaled checkpointer process.
    Checkpointer process by noticing that the current state
    transition, emits the barrier request, and then acknowledges back
    to the backend who requested the state change once the transition
    has been completed.  Final state will be updated in control file
    to make it persistent across the system restarts.

 2. When a backend receives the WAL-Prohibited barrier, at that moment if
    it is already in a transaction and the transaction already assigned XID,
    then the backend will be killed by throwing FATAL(XXX: need more discussion
    on this)

 3. Otherwise, if that backend running transaction without valid XID then, we
    don't need to do anything special right now, simply call
    ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
    XLogInsertAllowed() first which set WAL prohibited state appropriately.

 4. A new transaction (in an existing or in a new backend) starts as a
    read-only transaction.

 5. Autovacuum launcher as well as checkpointer will not do anything in
    WAL-Prohibited server state until someone wakes us up.  E.g. a backend
    might later on request us to put the system back where WAL is no longer
    prohibited.

 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
    and xlog rotation. Starting up again will perform crash recovery
    but the end-of-recovery checkpoint, necessary WAL write to start a
    server normally will be skipped and it will be performed when the
    system changed to WAL is no longer prohibited.

 7. Altering WAL-Prohibited mode is restricted on standby server.

 8. The presence of recovery.signal and/or recovery.signal file will
    implicitly pull out the server from the WAL prohibited state permanently.

 9. Add wal_prohibited GUC show the system state -- will be "on" when system
    is WAL prohibited.
---
 src/backend/access/transam/Makefile       |   1 +
 src/backend/access/transam/walprohibit.c  | 379 ++++++++++++++++++++++
 src/backend/access/transam/xact.c         |  36 +-
 src/backend/access/transam/xlog.c         | 162 ++++++++-
 src/backend/access/transam/xlogrecovery.c |  22 +-
 src/backend/catalog/system_functions.sql  |   2 +
 src/backend/commands/variable.c           |   7 +
 src/backend/postmaster/autovacuum.c       |   8 +-
 src/backend/postmaster/bgwriter.c         |   2 +-
 src/backend/postmaster/checkpointer.c     |  37 +++
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/ipc/procsignal.c      |   4 +
 src/backend/storage/lmgr/lock.c           |   6 +-
 src/backend/storage/sync/sync.c           |  31 +-
 src/backend/tcop/utility.c                |   1 +
 src/backend/utils/activity/wait_event.c   |   3 +
 src/backend/utils/misc/guc.c              |  27 ++
 src/bin/pg_controldata/pg_controldata.c   |   2 +
 src/include/access/walprohibit.h          |  60 ++++
 src/include/access/xlog.h                 |  13 +
 src/include/catalog/pg_control.h          |   3 +
 src/include/catalog/pg_proc.dat           |   4 +
 src/include/postmaster/bgwriter.h         |   2 +
 src/include/storage/procsignal.h          |   3 +-
 src/include/utils/wait_event.h            |   3 +-
 src/test/regress/expected/guc.out         |   3 +-
 src/tools/pgindent/typedefs.list          |   1 +
 27 files changed, 781 insertions(+), 48 deletions(-)
 create mode 100644 src/backend/access/transam/walprohibit.c
 create mode 100644 src/include/access/walprohibit.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 3e5444a6f7..8f9b267cf0 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -26,6 +26,7 @@ OBJS = \
 	twophase.o \
 	twophase_rmgr.o \
 	varsup.o \
+	walprohibit.o \
 	xact.o \
 	xlog.o \
 	xlogarchive.o \
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
new file mode 100644
index 0000000000..d968c71494
--- /dev/null
+++ b/src/backend/access/transam/walprohibit.c
@@ -0,0 +1,379 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprohibit.c
+ * 		PostgreSQL write-ahead log prohibit states
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/walprohibit.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/walprohibit.h"
+#include "fmgr.h"
+#include "pgstat.h"
+#include "port/atomics.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "storage/condition_variable.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/latch.h"
+#include "utils/acl.h"
+#include "utils/fmgroids.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Shared-memory WAL prohibit state structure
+ */
+typedef struct WALProhibitData
+{
+	/*
+	 * Indicates current WAL prohibit state counter and the last two bits of
+	 * this counter indicates current wal prohibit state.
+	 */
+	pg_atomic_uint32 wal_prohibit_counter;
+
+	/* Signaled when requested WAL prohibit state changes */
+	ConditionVariable wal_prohibit_cv;
+} WALProhibitData;
+
+static WALProhibitData *WALProhibit = NULL;
+
+static inline uint32 GetWALProhibitCounter(void);
+static inline uint32 AdvanceWALProhibitStateCounter(void);
+
+/*
+ * ProcessBarrierWALProhibit()
+ *
+ *	Force a backend to take an appropriate action when system wide WAL prohibit
+ *	state is changing.
+ */
+bool
+ProcessBarrierWALProhibit(void)
+{
+	/*
+	 * Kill off any transactions that have an XID *before* allowing the system
+	 * to go WAL prohibit state.
+	 */
+	if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
+	{
+		/*
+		 * Should be here only while transiting towards the WAL prohibit
+		 * state.
+		 */
+		Assert(GetWALProhibitState() == WALPROHIBIT_STATE_READ_ONLY);
+
+		/*
+		 * XXX: Kill off the whole session by throwing FATAL instead of
+		 * killing transaction by throwing ERROR due to following reasons that
+		 * need be thought:
+		 *
+		 * 1. Due to some presents challenges with the wire protocol, we could
+		 * not simply kill of idle transaction.
+		 *
+		 * 2. If we are here in subtransaction then the ERROR will kill the
+		 * current subtransaction only.  In the case of invalidations, that
+		 * might be good enough, but for XID assignment it's not, because
+		 * assigning an XID to a subtransaction also causes higher
+		 * sub-transaction levels and the parent transaction to get XIDs.
+		 */
+		ereport(FATAL,
+				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited"),
+				 errhint("Sessions with open write transactions must be terminated.")));
+	}
+
+	/* Return to "check" state */
+	ResetLocalXLogInsertAllowed();
+
+	return true;
+}
+
+/*
+ * pg_prohibit_wal()
+ *
+ *	SQL callable function to toggle WAL prohibit state.
+ *
+ *	NB: Function always returns true that leaves scope for the future code
+ *	changes might need to return false for some reason.
+ */
+Datum
+pg_prohibit_wal(PG_FUNCTION_ARGS)
+{
+	bool		walprohibit = PG_GETARG_BOOL(0);
+	uint32		wal_prohibit_counter;
+	uint32		target_counter_value;
+
+	/* WAL prohibit state changes not allowed during recovery. */
+	PreventCommandDuringRecovery("pg_prohibit_wal()");
+
+	/* For more detail on state transition, see comment for WALProhibitState */
+	switch (GetWALProhibitState())
+	{
+		case WALPROHIBIT_STATE_READ_WRITE:
+			if (!walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_WRITE:
+			if (walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL permission is already in progress"),
+						 errhint("Try again after sometime.")));
+			break;
+
+		case WALPROHIBIT_STATE_READ_ONLY:
+			if (walprohibit)
+				PG_RETURN_BOOL(true);	/* already in the requested state */
+			break;
+
+		case WALPROHIBIT_STATE_GOING_READ_ONLY:
+			if (!walprohibit)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("system state transition to WAL prohibition is already in progress"),
+						 errhint("Try again after sometime.")));
+			break;
+	}
+
+	wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+	target_counter_value = wal_prohibit_counter + 1;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* Target state must be the requested one. */
+		WALProhibitState target_state =
+			CounterGetWALProhibitState(target_counter_value);
+
+		Assert((walprohibit && target_state == WALPROHIBIT_STATE_READ_ONLY) ||
+			   (!walprohibit && target_state == WALPROHIBIT_STATE_READ_WRITE));
+	}
+#endif
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		ProcessWALProhibitStateChangeRequest();
+		PG_RETURN_BOOL(true);
+	}
+
+	/*
+	 * It is not a final state since we yet to convey this WAL prohibit state
+	 * to all backend.  Checkpointer will do that and update the shared memory
+	 * wal prohibit state counter and control file.
+	 */
+	if (!SendSignalToCheckpointer(SIGUSR1))
+	{
+		ereport(WARNING,
+				(errmsg("could not change system state now"),
+				 errdetail("Checkpointer might not be running."),
+				 errhint("The relaunched checkpointer process will automatically complete the system state change.")));
+		PG_RETURN_BOOL(true);		/* no wait */
+	}
+
+	/* Wait for the state counter in shared memory to change. */
+	ConditionVariablePrepareToSleep(&WALProhibit->wal_prohibit_cv);
+
+	/*
+	 * We'll be done once the wal prohibit state counter reaches to target
+	 * value.
+	 */
+	while (GetWALProhibitCounter() < target_counter_value)
+		ConditionVariableSleep(&WALProhibit->wal_prohibit_cv,
+							   WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+	ConditionVariableCancelSleep();
+
+	PG_RETURN_BOOL(true);
+}
+
+/*
+ * IsWALProhibited()
+ *
+ *	Is the system still in WAL prohibited state?
+ */
+bool
+IsWALProhibited(void)
+{
+	/* Other than read-write state will be considered as WAL prohibited state */
+	return (GetWALProhibitState() != WALPROHIBIT_STATE_READ_WRITE);
+}
+
+/*
+ * AdvanceWALProhibitStateCounter()
+ *
+ *	Increment wal prohibit counter by 1.
+ */
+static inline uint32
+AdvanceWALProhibitStateCounter(void)
+{
+	return pg_atomic_add_fetch_u32(&WALProhibit->wal_prohibit_counter, 1);
+}
+
+/*
+ * ProcessWALProhibitStateChangeRequest()
+ */
+void
+ProcessWALProhibitStateChangeRequest(void)
+{
+	bool		isReadOnlyRequest;
+	uint64		barrier_gen;
+	uint32		wal_prohibit_counter PG_USED_FOR_ASSERTS_ONLY;
+	WALProhibitState cur_state;
+
+	/*
+	 * Should be a checkpointer process or a single-user backend to complete WAL
+	 * prohibit state transition.
+	 */
+	if (!(AmCheckpointerProcess() || !IsPostmasterEnvironment))
+		return;
+
+	/* Fetch shared wal prohibit state */
+	cur_state = GetWALProhibitState();
+
+	/* Quick exit if not in transition state */
+	if (cur_state != WALPROHIBIT_STATE_GOING_READ_ONLY &&
+		cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE)
+		return;
+
+	isReadOnlyRequest = (cur_state == WALPROHIBIT_STATE_GOING_READ_ONLY);
+
+	/*
+	 * Update control file to make the state persistent.
+	 *
+	 * Once wal prohibit state transition set then that needs to be completed.
+	 * If the server crashes before the state completion, then the control file
+	 * information will be used to set the final wal prohibit state on restart.
+	 */
+	SetControlFileWALProhibitFlag(isReadOnlyRequest);
+
+	/*
+	 * If the server is started in wal prohibited state then the required wal
+	 * write operation in the startup process to start the server normally has
+	 * been skipped at that time, if it is, then does that right away.
+	 */
+	if (!isReadOnlyRequest)
+	{
+		XLogAcceptWritesState state = GetXLogWriteAllowedState();
+
+		/* Quick exit if already in XLogAcceptWrites() */
+		if (state == XLOG_ACCEPT_WRITES_STARTED)
+			return;
+
+		/* Complete pending XLogAcceptWrites() */
+		if (state != XLOG_ACCEPT_WRITES_DONE)
+			PerformPendingXLogAcceptWrites();
+	}
+
+	/*
+	 * Increment shared memory state counter first and them emit global barrier.
+	 *
+	 * Any backend connect afterword will follow respective WAL prohibited state.
+	 */
+	wal_prohibit_counter = AdvanceWALProhibitStateCounter();
+
+	/*
+	 * WAL prohibit state change is initiated.  We need to complete the state
+	 * transition by setting requested WAL prohibit state in all backends.
+	 */
+	elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change");
+
+	/* Emit global barrier */
+	barrier_gen = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
+	WaitForProcSignalBarrier(barrier_gen);
+
+	/*
+	 * Don't need to be too aggressive to flush XLOG data right away since
+	 * XLogFlush is not restricted in the wal prohibited state.
+	 */
+	XLogFlush(GetXLogWriteRecPtr());
+
+	if (isReadOnlyRequest)
+	{
+		/* Should have set the final state where WAL is prohibited. */
+		Assert(CounterGetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_ONLY);
+
+		ereport(LOG, (errmsg("WAL is now prohibited")));
+	}
+	else
+	{
+		/*
+		 * Should have set the final state where WAL is no longer
+		 * prohibited.
+		 */
+		Assert(CounterGetWALProhibitState(wal_prohibit_counter) ==
+			   WALPROHIBIT_STATE_READ_WRITE);
+
+		ereport(LOG, (errmsg("WAL is no longer prohibited")));
+	}
+
+	/* Wake up the backend waiting on this. */
+	ConditionVariableBroadcast(&WALProhibit->wal_prohibit_cv);
+}
+
+/*
+ * GetWALProhibitCounter()
+ */
+static inline uint32
+GetWALProhibitCounter(void)
+{
+	return pg_atomic_read_u32(&WALProhibit->wal_prohibit_counter);
+}
+
+/*
+ * GetWALProhibitState()
+ */
+WALProhibitState
+GetWALProhibitState(void)
+{
+	return CounterGetWALProhibitState(GetWALProhibitCounter());
+}
+
+/*
+ * WALProhibitStateCounterInit()
+ *
+ * Initialization of shared wal prohibit state counter.
+ */
+void
+WALProhibitStateCounterInit(bool wal_prohibited)
+{
+	WALProhibitState new_state;
+
+	Assert(AmStartupProcess() || !IsPostmasterEnvironment);
+
+	new_state = wal_prohibited ?
+		WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
+
+	pg_atomic_init_u32(&WALProhibit->wal_prohibit_counter, (uint32) new_state);
+}
+
+/*
+ * WALProhibitStateShmemInit()
+ *
+ *	Initialization of shared memory for WAL prohibit state.
+ */
+void
+WALProhibitStateShmemInit(void)
+{
+	bool		found;
+
+	WALProhibit = (WALProhibitData *)
+		ShmemInitStruct("WAL Prohibit State",
+						sizeof(WALProhibitData),
+						&found);
+
+	if (!found)
+	{
+		/* First time through ... */
+		memset(WALProhibit, 0, sizeof(WALProhibitData));
+		ConditionVariableInit(&WALProhibit->wal_prohibit_cv);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bf2fc08d94..62d1018220 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2030,23 +2030,27 @@ StartTransaction(void)
 	Assert(s->prevSecContext == 0);
 
 	/*
-	 * Make sure we've reset xact state variables
+	 * Reset xact state variables.
 	 *
-	 * If recovery is still in progress, mark this transaction as read-only.
-	 * We have lower level defences in XLogInsert and elsewhere to stop us
-	 * from modifying data during recovery, but this gives the normal
-	 * indication to the user that the transaction is read-only.
-	 */
-	if (RecoveryInProgress())
-	{
-		s->startedInRecovery = true;
-		XactReadOnly = true;
-	}
-	else
-	{
-		s->startedInRecovery = false;
-		XactReadOnly = DefaultXactReadOnly;
-	}
+	 * If it is not currently possible to insert write-ahead log records, either
+	 * because we are still in recovery or because pg_prohibit_wal() function
+	 * has been executed, force this to be a read-only transaction.  We have
+	 * lower level defences in XLogBeginInsert() and elsewhere to stop us from
+	 * modifying data during recovery when !XLogInsertAllowed(), but this gives
+	 * the normal indication to the user that the transaction is read-only.
+	 *
+	 * On the other hand, we only need to set the startedInRecovery flag when
+	 * the transaction started during recovery, and not when WAL is otherwise
+	 * prohibited. This information is used by RelationGetIndexScan() to decide
+	 * whether to permit (1) relying on existing killed-tuple markings and (2)
+	 * further killing of index tuples. Even when WAL is prohibited on the
+	 * master, it's still the master, so the former is OK; and since killing
+	 * index tuples doesn't generate WAL, the latter is also OK.  See comments
+	 * in RelationGetIndexScan() and MarkBufferDirtyHint().
+	 */
+	XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
+	s->startedInRecovery = RecoveryInProgress();
+
 	XactDeferrable = DefaultXactDeferrable;
 	XactIsoLevel = DefaultXactIsoLevel;
 	forceSyncCommit = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19a499e4e6..5447dcb085 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -55,6 +55,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
@@ -226,9 +227,10 @@ static bool LocalRecoveryInProgress = true;
  *		0: unconditionally not allowed to insert XLOG
  *		-1: must check RecoveryInProgress(); disallow until it is false
  * Most processes start with -1 and transition to 1 after seeing that recovery
- * is not in progress.  But we can also force the value for special cases.
- * The coding in XLogInsertAllowed() depends on the first two of these states
- * being numerically the same as bool true and false.
+ * is not in progress or the server state is not a WAL prohibited state.  But
+ * we can also force the value for special cases.  The coding in
+ * XLogInsertAllowed() depends on the first two of these states being
+ * numerically the same as bool true and false.
  */
 static int	LocalXLogInsertAllowed = -1;
 
@@ -569,6 +571,12 @@ typedef struct XLogCtlData
 	XLogRecPtr	endOfLog;
 	TimeLineID	endOfLogTLI;
 
+	/*
+	 * SharedXLogAllowWritesState indicates the state of the last recovery
+	 * checkpoint and required wal write to start the normal server.
+	 * Protected by info_lck.
+	 */
+	XLogAcceptWritesState SharedXLogAllowWritesState;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
@@ -4216,6 +4224,16 @@ UpdateControlFile(void)
 	update_controlfile(DataDir, ControlFile, true);
 }
 
+/* Set ControlFile's WAL prohibit flag */
+	void
+SetControlFileWALProhibitFlag(bool walProhibited)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_prohibited = walProhibited;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Returns the unique system identifier from control file.
  */
@@ -4490,6 +4508,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_NONE;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -5588,8 +5607,29 @@ StartupXLOG(void)
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 
-	/* Prepare to accept WAL writes. */
-	promoted = XLogAcceptWrites();
+	/*
+	 * Before enabling WAL insertion, initialize WAL prohibit state in shared
+	 * memory that will decide the further WAL insert should be allowed or
+	 * not.
+	 */
+	WALProhibitStateCounterInit(ControlFile->wal_prohibited);
+
+	/*
+	 * Skip wal writes and end of recovery checkpoint if the system is in WAL
+	 * prohibited state.
+	 */
+	if (IsWALProhibited())
+	{
+		XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DELAYED;
+
+		ereport(LOG,
+				(errmsg("skipping startup checkpoint because the WAL is now prohibited")));
+	}
+	else
+	{
+		/* Prepare to accept WAL writes. */
+		promoted = XLogAcceptWrites();
+	}
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -5658,6 +5698,28 @@ XLogAcceptWrites(void)
 	 */
 	volatile XLogCtlData *xlogctl = XLogCtl;
 
+	/*
+	 * Quick exit if required wal writes to start server normally are performed
+	 * or in progress.
+	 */
+	if (xlogctl->SharedXLogAllowWritesState == XLOG_ACCEPT_WRITES_DONE ||
+		xlogctl->SharedXLogAllowWritesState == XLOG_ACCEPT_WRITES_STARTED)
+		return promoted;
+
+	/*
+	 * Update started shared memory status to avoid any re-entrance.  Spinlock
+	 * protection isn't needed since only one process will be updating this
+	 * value at a time.
+	 */
+	xlogctl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_STARTED;
+
+	/*
+	 * If the system in wal prohibited state, then only the checkpointer process
+	 * should be here to complete this operation which might have skipped
+	 * previously while booting the system in WAL prohibited state.
+	 */
+	Assert(!IsWALProhibited() || AmCheckpointerProcess());
+
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
@@ -5699,9 +5761,34 @@ XLogAcceptWrites(void)
 	 */
 	CompleteCommitTsInitialization();
 
+	/*
+	 * Update completion status.
+	 */
+	XLogCtl->SharedXLogAllowWritesState = XLOG_ACCEPT_WRITES_DONE;
+
 	return promoted;
 }
 
+/*
+ * Wrapper function to call XLogAcceptWrites() for checkpointer process.
+ */
+void
+PerformPendingXLogAcceptWrites(void)
+{
+	/* Prepare to accept WAL writes. */
+	(void) XLogAcceptWrites();
+
+	/*
+	 * We need to update DBState explicitly like the startup process
+	 * because end-of-recovery checkpoint would set db state to
+	 * shutdown.
+	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->state = DB_IN_PRODUCTION;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
 /*
  * Callback from PerformWalRecovery(), called when we switch from crash
  * recovery to archive recovery mode.  Updates the control file accordingly.
@@ -5815,6 +5902,11 @@ PerformRecoveryXLogAction(void)
 		 */
 		CreateEndOfRecoveryRecord();
 	}
+	else if (AmCheckpointerProcess())
+	{
+		/* In checkpointer process, just do it ourselves */
+		CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+	}
 	else
 	{
 		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
@@ -5882,9 +5974,9 @@ GetRecoveryState(void)
 /*
  * Is this process allowed to insert new WAL records?
  *
- * Ordinarily this is essentially equivalent to !RecoveryInProgress().
- * But we also have provisions for forcing the result "true" or "false"
- * within specific processes regardless of the global state.
+ * Ordinarily this is essentially equivalent to !RecoveryInProgress() and
+ * !IsWALProhibited().  But we also have provisions for forcing the result
+ * "true" or "false" within specific processes regardless of the global state.
  */
 bool
 XLogInsertAllowed(void)
@@ -5903,9 +5995,20 @@ XLogInsertAllowed(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/* Or, in WAL prohibited state */
+	if (IsWALProhibited())
+	{
+		/*
+		 * Set it to "unconditionally false" to avoid checking until it gets
+		 * reset.
+		 */
+		LocalXLogInsertAllowed = 0;
+		return false;
+	}
+
 	/*
-	 * On exit from recovery, reset to "unconditionally true", since there is
-	 * no need to keep checking.
+	 * On exit from recovery or WAL prohibited state, reset to
+	 * "unconditionally true", since there is no need to keep checking.
 	 */
 	LocalXLogInsertAllowed = 1;
 	return true;
@@ -5929,6 +6032,12 @@ LocalSetXLogInsertAllowed(void)
 	return oldXLogAllowed;
 }
 
+void
+ResetLocalXLogInsertAllowed(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
 /*
  * Return the current Redo pointer from shared memory.
  *
@@ -6079,6 +6188,16 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 	return result;
 }
 
+/*
+ * Fetch latest state of allow WAL writes.
+ */
+XLogAcceptWritesState
+GetXLogWriteAllowedState(void)
+{
+	/* Since the value can't be changing concurrently, no lock is required. */
+	return ((volatile XLogCtlData *) XLogCtl)->SharedXLogAllowWritesState;
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
@@ -6109,9 +6228,13 @@ ShutdownXLOG(int code, Datum arg)
 	 */
 	WalSndWaitStopping();
 
+	/*
+	 * The restartpoint, checkpoint, or xlog rotation will be performed if the
+	 * WAL writing is permitted.
+	 */
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
-	else
+	else if (XLogInsertAllowed())
 	{
 		/*
 		 * If archiving is enabled, rotate the last XLOG file so that all the
@@ -6124,6 +6247,9 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+	else
+		ereport(LOG,
+				(errmsg("skipping shutdown checkpoint because the WAL is now prohibited")));
 }
 
 /*
@@ -6374,8 +6500,13 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
-		elog(ERROR, "can't create a checkpoint during recovery");
+	if ((flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	{
+		if (RecoveryInProgress())
+			elog(ERROR, "can't create a checkpoint during recovery");
+		else if (!XLogInsertAllowed())
+			elog(ERROR, "can't create a checkpoint while WAL is prohibited");
+	}
 
 	/*
 	 * Prepare to accumulate statistics.
@@ -6846,10 +6977,11 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn, XLogRecPtr pagePtr,
 	XLogRecPtr	recptr;
 	XLogPageHeader pagehdr;
 	XLogRecPtr	startPos;
+	XLogAcceptWritesState state = GetXLogWriteAllowedState();
 
 	/* sanity checks */
-	if (!RecoveryInProgress())
-		elog(ERROR, "can only be used at end of recovery");
+	if (state != XLOG_ACCEPT_WRITES_STARTED)
+		elog(ERROR, "can only be used at enabling WAL writes");
 	if (pagePtr % XLOG_BLCKSZ != 0)
 		elog(ERROR, "invalid position for missing continuation record %X/%X",
 			 LSN_FORMAT_ARGS(pagePtr));
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 99cad8bcf9..e120970446 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -558,13 +558,27 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 					(errmsg("starting archive recovery")));
 	}
 
-	/*
-	 * Take ownership of the wakeup latch if we're going to sleep during
-	 * recovery.
-	 */
 	if (ArchiveRecoveryRequested)
+	{
+		/*
+		 * Take ownership of the wakeup latch if we're going to sleep during
+		 * recovery.
+		 */
 		OwnLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 
+		/*
+		 * Since archive recovery is requested, we cannot be in a wal prohibited
+		 * state.
+		 */
+		if (ControlFile->wal_prohibited)
+		{
+			/* No need to hold ControlFileLock yet, we aren't up far enough */
+			ControlFile->wal_prohibited = false;
+			ereport(LOG,
+					(errmsg("clearing WAL prohibition because the system is in archive recovery")));
+		}
+	}
+
 	private = palloc0(sizeof(XLogPageReadPrivate));
 	xlogreader =
 		XLogReaderAllocate(wal_segment_size, NULL,
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 73da687d5d..ad8769a534 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -709,6 +709,8 @@ REVOKE EXECUTE ON FUNCTION pg_ls_logicalmapdir() FROM PUBLIC;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_replslotdir(text) FROM PUBLIC;
 
+REVOKE EXECUTE ON FUNCTION pg_prohibit_wal(bool) FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index e5ddcda0b4..a446d9c13a 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -508,6 +508,13 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
 			GUC_check_errmsg("cannot set transaction read-write mode during recovery");
 			return false;
 		}
+		/* Can't go to r/w mode while WAL is prohibited */
+		if (!XLogInsertAllowed())
+		{
+			GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
+			GUC_check_errmsg("cannot set transaction read-write mode while WAL is prohibited");
+			return false;
+		}
 	}
 
 	return true;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f36c40e852..1f0398e1f1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -694,10 +694,12 @@ AutoVacLauncherMain(int argc, char *argv[])
 
 		/*
 		 * There are some conditions that we need to check before trying to
-		 * start a worker.  First, we need to make sure that there is a worker
-		 * slot available.  Second, we need to make sure that no other worker
-		 * failed while starting up.
+		 * start a worker.  First, the wal writes are permitted.  Second, we
+		 * need to make sure that there is a worker slot available.  Third, we
+		 * need to make sure that no other worker failed while starting up.
 		 */
+		if (!XLogInsertAllowed())
+			continue;
 
 		current_time = GetCurrentTimestamp();
 		LWLockAcquire(AutovacuumLock, LW_SHARED);
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 91e6f6ea18..ecbcac18c5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -273,7 +273,7 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if (XLogStandbyInfoActive() && XLogInsertAllowed())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c937c39f50..09dd645bca 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -36,6 +36,7 @@
 #include <sys/time.h>
 #include <time.h>
 
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
@@ -350,6 +351,7 @@ CheckpointerMain(void)
 		pg_time_t	now;
 		int			elapsed_secs;
 		int			cur_timeout;
+		WALProhibitState cur_state;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -359,6 +361,22 @@ CheckpointerMain(void)
 		 */
 		AbsorbSyncRequests();
 		HandleCheckpointerInterrupts();
+		ProcessWALProhibitStateChangeRequest();
+
+		/* Should be in WAL permitted state to perform the checkpoint */
+		cur_state = GetWALProhibitState();
+		if (cur_state != WALPROHIBIT_STATE_READ_WRITE)
+		{
+			/*
+			 * Don't let Checkpointer process do anything until someone wakes it
+			 * up.  For example a backend might later on request us to put the
+			 * system back to read-write state.
+			 */
+			if (cur_state == WALPROHIBIT_STATE_READ_ONLY)
+				(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+								 -1, WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
+			continue;
+		}
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -702,6 +720,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!AmCheckpointerProcess())
 		return;
 
+	/* Check for wal prohibit state change request */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind schedule,
 	 * in which case we just try to catch up as quickly as possible.
@@ -1352,3 +1373,19 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * SendSignalToCheckpointer allows any process to send a signal to the checkpoint
+ * process.
+ */
+bool
+SendSignalToCheckpointer(int signum)
+{
+	if (CheckpointerShmem->checkpointer_pid == 0)
+		return false;
+
+	if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
+		return false;
+
+	return true; /* Signaled checkpointer successfully */
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 75e456360b..68f434ef74 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
@@ -254,6 +255,12 @@ CreateSharedMemoryAndSemaphores(void)
 	MultiXactShmemInit();
 	InitBufferPool();
 
+	/*
+	 * Set up shared memory structure need to handle concurrent WAL prohibit
+	 * state change requests.
+	 */
+	WALProhibitStateShmemInit();
+
 	/*
 	 * Set up lock manager
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index f41563a0a4..471ed11128 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 
 #include "access/parallel.h"
+#include "access/walprohibit.h"
 #include "port/pg_bitutils.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -539,6 +540,9 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_WALPROHIBIT:
+						processed = ProcessBarrierWALProhibit();
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index ee2e15c17e..36a08ebf30 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,15 +794,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);
 
-	if (RecoveryInProgress() && !InRecovery &&
+	if (!XLogInsertAllowed() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
 		lockmode > RowExclusiveLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
+				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress or when WAL is prohibited",
 						lockMethodTable->lockModeNames[lockmode]),
-				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery or when WAL is prohibited")));
 
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index c695d816fc..1512150e4f 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -21,6 +21,7 @@
 #include "access/commit_ts.h"
 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -255,9 +256,17 @@ SyncPostCheckpoint(void)
 		entry->canceled = true;
 
 		/*
-		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
-		 * requests for a long time when there are many deletions to be done.
-		 * We can safely call AbsorbSyncRequests() at this point in the loop.
+		 * As in ProcessSyncRequests, we don't want to stop processing wal
+		 * prohibit change requests for a long time when there are many
+		 * deletions to be done.  It needs to be check and processed by
+		 * checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
+		/*
+		 * Similarly, we don't want to stop absorbing fsync requests for the
+		 * long time.  We can safely call AbsorbSyncRequests() at this point in
+		 * the loop (note it might try to delete list entries).
 		 */
 		if (--absorb_counter <= 0)
 		{
@@ -315,6 +324,9 @@ ProcessSyncRequests(void)
 	if (!pendingOps)
 		elog(ERROR, "cannot sync without a pendingOps table");
 
+	/* Check for wal prohibit state change request for checkpointer */
+	ProcessWALProhibitStateChangeRequest();
+
 	/*
 	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.  The tightest
@@ -373,6 +385,13 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
+		/*
+		 * Don't want to stop processing wal prohibit change requests for a long
+		 * time when there are many fsync requests to be processed.  It needs to
+		 * be check and processed by checkpointer as soon as possible.
+		 */
+		ProcessWALProhibitStateChangeRequest();
+
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -459,6 +478,12 @@ ProcessSyncRequests(void)
 							 errmsg_internal("could not fsync file \"%s\" but retrying: %m",
 											 path)));
 
+				/*
+				 * For the same reason mentioned previously for the same
+				 * function call.
+				 */
+				ProcessWALProhibitStateChangeRequest();
+
 				/*
 				 * Absorb incoming requests and check to see if a cancel
 				 * arrived for this relation fork.
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f364a9b88a..b6f96a9cb9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 87c15b9c6f..aa5330eb63 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -741,6 +741,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_WALPROHIBIT_STATE_CHANGE:
+			event_name = "SystemWALProhibitStateChange";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 22b5571a70..72e9990ca9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -39,6 +39,7 @@
 #include "access/toast_compression.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogprefetcher.h"
@@ -242,6 +243,7 @@ static bool check_recovery_target_lsn(char **newval, void **extra, GucSource sou
 static void assign_recovery_target_lsn(const char *newval, void *extra);
 static bool check_primary_slot_name(char **newval, void **extra, GucSource source);
 static bool check_default_with_oids(bool *newval, void **extra, GucSource source);
+static const char *show_wal_prohibited(void);
 
 /* Private functions in guc-file.l that need to be called from guc.c */
 static ConfigVariable *ProcessConfigFileInternal(GucContext context,
@@ -716,6 +718,7 @@ static char *recovery_target_string;
 static char *recovery_target_xid_string;
 static char *recovery_target_name_string;
 static char *recovery_target_lsn_string;
+static bool wal_prohibited;
 
 
 /* should be static, but commands/variable.c needs to get at this */
@@ -2182,6 +2185,18 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_prohibited", PGC_INTERNAL, WAL_SETTINGS,
+			gettext_noop("Shows whether the WAL is prohibited."),
+			NULL,
+			GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&wal_prohibited,
+		false,
+		NULL, NULL, show_wal_prohibited
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -12891,4 +12906,16 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+static const char *
+show_wal_prohibited(void)
+{
+	if (IsWALProhibited())
+		return "on";
+	return "off";
+}
+
 #include "guc-file.c"
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..e4d99a50c0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("WAL write prohibited:                 %s\n"),
+		   ControlFile->wal_prohibited ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
new file mode 100644
index 0000000000..d71522cbf3
--- /dev/null
+++ b/src/include/access/walprohibit.h
@@ -0,0 +1,60 @@
+/*
+ * walprohibit.h
+ *
+ * PostgreSQL write-ahead log prohibit states
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * src/include/access/walprohibit.h
+ */
+
+#ifndef WALPROHIBIT_H
+#define WALPROHIBIT_H
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+
+/*
+ * WAL Prohibit States.
+ *
+ * There are four possible WAL states.  A brand new database cluster is always
+ * initially WALPROHIBIT_STATE_READ_WRITE.  If the user tries to make it read
+ * only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY.  When the
+ * transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY.  If
+ * the user subsequently tries to make it read write, we will enter the state
+ * WALPROHIBIT_STATE_GOING_READ_WRITE.  When that transition is complete, we
+ * will enter the state WALPROHIBIT_STATE_READ_WRITE.  These four state
+ * transitions are the only ones possible; for example, if we're currently in
+ * state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will
+ * produce an error, and a second attempt to go read-only will not cause a state
+ * change.  Thus, we can represent the state as a shared-memory counter whose
+ * value only ever changes by adding 1.  The initial value at postmaster startup
+ * is either 0 or 2, depending on whether the control file specifies the system
+ * is starting read-write or read-only.
+ */
+typedef enum
+{
+	WALPROHIBIT_STATE_READ_WRITE = 0,		/* WAL permitted */
+	WALPROHIBIT_STATE_GOING_READ_ONLY = 1,
+	WALPROHIBIT_STATE_READ_ONLY = 2,		/* WAL prohibited */
+	WALPROHIBIT_STATE_GOING_READ_WRITE = 3
+} WALProhibitState;
+
+static inline WALProhibitState
+CounterGetWALProhibitState(uint32 wal_prohibit_counter)
+{
+	/* Extract last two bits */
+	return (WALProhibitState) (wal_prohibit_counter & 3);
+}
+
+extern bool ProcessBarrierWALProhibit(void);
+extern void MarkCheckPointSkippedInWalProhibitState(void);
+extern void WALProhibitStateCounterInit(bool wal_prohibited);
+extern void WALProhibitStateShmemInit(void);
+extern bool IsWALProhibited(void);
+extern void ProcessWALProhibitStateChangeRequest(void);
+extern WALProhibitState GetWALProhibitState(void);
+
+#endif							/* WALPROHIBIT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e1e3446ae..19bf3171bd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -80,6 +80,15 @@ typedef enum WalCompression
 	WAL_COMPRESSION_ZSTD
 } WalCompression;
 
+/* State of XLogAcceptWrites() execution */
+typedef enum XLogAcceptWritesState
+{
+	XLOG_ACCEPT_WRITES_NONE = 0,	/* initial state, not started */
+	XLOG_ACCEPT_WRITES_DELAYED,		/* skipped XLogAcceptWrites() for now */
+	XLOG_ACCEPT_WRITES_STARTED,		/* inside XLogAcceptWrites() */
+	XLOG_ACCEPT_WRITES_DONE			/* done with XLogAcceptWrites() */
+} XLogAcceptWritesState;
+
 /* Recovery states */
 typedef enum RecoveryState
 {
@@ -217,6 +226,7 @@ extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool XLogInsertAllowed(void);
+extern void ResetLocalXLogInsertAllowed(void);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 
@@ -230,6 +240,7 @@ extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
+extern void PerformPendingXLogAcceptWrites(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
 extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN);
@@ -243,8 +254,10 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern XLogAcceptWritesState GetXLogWriteAllowedState(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
+extern void SetControlFileWALProhibitFlag(bool wal_prohibited);
 
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06368e2366..7078245b64 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,9 @@ typedef struct ControlFileData
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
+	/* WAL prohibited determines if the WAL insert is allowed or not. */
+	bool		wal_prohibited;
+
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6d378ff785..67dcb8af87 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11838,6 +11838,10 @@
   proname => 'pg_partition_root', prorettype => 'regclass',
   proargtypes => 'regclass', prosrc => 'pg_partition_root' },
 
+{ oid => '4549', descr => 'change server to permit or prohibit wal writes',
+  proname => 'pg_prohibit_wal', prorettype => 'bool',
+  proargtypes => 'bool', prosrc => 'pg_prohibit_wal' },
+
 { oid => '4350', descr => 'Unicode normalization',
   proname => 'normalize', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'unicode_normalize_func' },
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2882efd67b..738ea5b0bb 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern bool SendSignalToCheckpointer(int signum);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..2ab644e3e0 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -49,7 +49,8 @@ typedef enum
 
 typedef enum
 {
-	PROCSIGNAL_BARRIER_SMGRRELEASE	/* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SMGRRELEASE,	/* ask smgr to close files */
+	PROCSIGNAL_BARRIER_WALPROHIBIT
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index b578e2ec75..552a5d0e2d 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -229,7 +229,8 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_WALPROHIBIT_STATE_CHANGE
 } WaitEventIO;
 
 
diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out
index 3de6404ba5..a9eed063d5 100644
--- a/src/test/regress/expected/guc.out
+++ b/src/test/regress/expected/guc.out
@@ -896,7 +896,8 @@ SELECT name FROM tab_settings_flags
  transaction_deferrable
  transaction_isolation
  transaction_read_only
-(3 rows)
+ wal_prohibited
+(4 rows)
 
 -- NO_SHOW_ALL implies NOT_IN_SAMPLE.
 SELECT name FROM tab_settings_flags
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be3fafadf8..9e02d0507c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2854,6 +2854,7 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALProhibitData
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.18.0

v45-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchapplication/octet-stream; name=v45-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patchDownload

From 3ddcc088e4f8ba0f0f2abc080f263707bd46e4b1 Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 27 Jul 2020 02:13:36 -0400
Subject: [PATCH v45 4/6] Error or Assert before START_CRIT_SECTION for WAL
 write

Based on the following criteria adding an Assert or an Error when system is
prohibited:

 - Add ERROR for the function which can be reachable without valid XID in case
   of VACUUM or CONCURRENT CREATE INDEX.  For that added common static inline
   function CheckWALPermitted().
 - Add ASSERT for the function which cannot be reached without valid XID;
   Assert to ensure XID validation. For that added AssertWALPermitted_HaveXID().

To enforce the rule to have aforesaid assert or error check before entering a
critical section for WAL write, a new walpermit_checked_state assert only flag
is added. If these check is missing then XLogBeginInsert() will have an
assertion if it is in critical section.

If we are not doing WAL insert inside the critical section then the above
checking is not necessary, we can rely on XLogBeginInsert() for that check &
report an error.
---
 contrib/pg_surgery/heap_surgery.c         | 10 ++++--
 src/backend/access/brin/brin.c            |  4 +++
 src/backend/access/brin/brin_pageops.c    | 21 +++++++++--
 src/backend/access/brin/brin_revmap.c     | 10 +++++-
 src/backend/access/gin/ginbtree.c         | 15 ++++++--
 src/backend/access/gin/gindatapage.c      | 18 ++++++++--
 src/backend/access/gin/ginfast.c          | 11 ++++--
 src/backend/access/gin/gininsert.c        |  4 +++
 src/backend/access/gin/ginutil.c          |  9 ++++-
 src/backend/access/gin/ginvacuum.c        | 11 +++++-
 src/backend/access/gist/gist.c            | 25 ++++++++++---
 src/backend/access/gist/gistvacuum.c      | 13 +++++--
 src/backend/access/hash/hash.c            | 19 ++++++++--
 src/backend/access/hash/hashinsert.c      |  9 ++++-
 src/backend/access/hash/hashovfl.c        | 22 +++++++++---
 src/backend/access/hash/hashpage.c        |  9 +++++
 src/backend/access/heap/heapam.c          | 26 +++++++++++++-
 src/backend/access/heap/pruneheap.c       | 16 ++++++---
 src/backend/access/heap/vacuumlazy.c      | 22 +++++++++---
 src/backend/access/heap/visibilitymap.c   | 19 ++++++++--
 src/backend/access/nbtree/nbtdedup.c      |  3 ++
 src/backend/access/nbtree/nbtinsert.c     | 24 ++++++++++---
 src/backend/access/nbtree/nbtpage.c       | 34 +++++++++++++++---
 src/backend/access/spgist/spgdoinsert.c   | 13 +++++++
 src/backend/access/spgist/spgvacuum.c     | 22 ++++++++++--
 src/backend/access/transam/multixact.c    |  5 ++-
 src/backend/access/transam/twophase.c     |  9 +++++
 src/backend/access/transam/varsup.c       |  4 +++
 src/backend/access/transam/walprohibit.c  | 10 ++++++
 src/backend/access/transam/xact.c         |  6 ++++
 src/backend/access/transam/xlog.c         | 30 +++++++++++-----
 src/backend/access/transam/xloginsert.c   | 21 +++++++++--
 src/backend/commands/dbcommands.c         |  3 ++
 src/backend/commands/sequence.c           | 26 +++++++++++---
 src/backend/postmaster/checkpointer.c     |  4 +++
 src/backend/storage/buffer/bufmgr.c       | 13 ++++---
 src/backend/storage/freespace/freespace.c | 10 +++++-
 src/backend/utils/cache/relmapper.c       |  3 ++
 src/include/access/walprohibit.h          | 44 +++++++++++++++++++++++
 src/include/miscadmin.h                   | 26 ++++++++++++++
 40 files changed, 528 insertions(+), 75 deletions(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 3e641aa644..9952f01a71 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -14,6 +14,7 @@
 
 #include "access/heapam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_proc_d.h"
@@ -89,6 +90,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	OffsetNumber curr_start_ptr,
 				next_start_ptr;
 	bool		include_this_tid[MaxHeapTuplesPerPage];
+	bool		needwal;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -100,6 +102,7 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 	sanity_check_tid_array(ta, &ntids);
 
 	rel = relation_open(relid, RowExclusiveLock);
+	needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Check target relation.
@@ -235,6 +238,9 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
 			visibilitymap_pin(rel, blkno, &vmbuf);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) from here until all the changes are logged. */
 		START_CRIT_SECTION();
 
@@ -315,12 +321,12 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				log_newpage_buffer(buf, true);
 		}
 
 		/* WAL log the VM page if it was modified. */
-		if (did_modify_vm && RelationNeedsWAL(rel))
+		if (did_modify_vm && needwal)
 			log_newpage_buffer(vmbuf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4366010768..c725db5a32 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -24,6 +24,7 @@
 #include "access/relscan.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
@@ -901,6 +902,9 @@ brinbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 17257919db..3c8443d3c2 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -14,6 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_revmap.h"
 #include "access/brin_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -65,6 +66,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		newbuf;
 	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
+	bool		needwal = RelationNeedsWAL(idxrel);
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -176,13 +178,16 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 		if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) unconstify(BrinTuple *, newtup), newsz))
 			elog(ERROR, "failed to replace BRIN tuple");
 		MarkBufferDirty(oldbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
@@ -240,6 +245,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		/*
@@ -267,7 +275,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(revmapbuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(idxrel))
+		if (needwal)
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
@@ -351,6 +359,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	Buffer		revmapbuf;
 	ItemPointerData tid;
 	bool		extended;
+	bool		needwal;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -405,6 +414,10 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	page = BufferGetPage(*buffer);
 	blk = BufferGetBlockNumber(*buffer);
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Execute the actual insertion */
 	START_CRIT_SECTION();
 	if (extended)
@@ -424,7 +437,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	MarkBufferDirty(revmapbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
@@ -881,6 +894,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 			   "brin_initialize_empty_new_buffer: initializing blank page %u",
 			   BufferGetBlockNumber(buffer)));
 
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 6e392a551a..4dbc27ca9e 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -26,6 +26,7 @@
 #include "access/brin_tuple.h"
 #include "access/brin_xlog.h"
 #include "access/rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -340,6 +341,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
 	ItemId		lp;
+	bool		needwal;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
@@ -405,6 +407,10 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	 * crashed or aborted summarization; remove them silently.
 	 */
 
+	needwal = RelationNeedsWAL(idxrel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	ItemPointerSetInvalid(&invalidIptr);
@@ -416,7 +422,7 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	MarkBufferDirty(regBuf);
 	MarkBufferDirty(revmapBuf);
 
-	if (RelationNeedsWAL(idxrel))
+	if (needwal)
 	{
 		xl_brin_desummarize xlrec;
 		XLogRecPtr	recptr;
@@ -614,6 +620,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 		return;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Ok, we have now locked the metapage and the target block. Re-initialize
 	 * the target block as a revmap page, and update the metapage.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 8df45478f1..13ee954cc0 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/predicate.h"
@@ -332,6 +333,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 {
 	Page		page = BufferGetPage(stack->buffer);
 	bool		result;
+	bool		needwal;
 	GinPlaceToPageRC rc;
 	uint16		xlflags = 0;
 	Page		childpage = NULL;
@@ -377,6 +379,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 								 insertdata, updateblkno,
 								 &ptp_workspace,
 								 &newlpage, &newrpage);
+	needwal = RelationNeedsWAL(btree->index) && !btree->isBuild;
 
 	if (rc == GPTP_NO_WORK)
 	{
@@ -385,10 +388,13 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 	}
 	else if (rc == GPTP_INSERT)
 	{
+		if (needwal)
+			CheckWALPermitted();
+
 		/* It will fit, perform the insertion */
 		START_CRIT_SECTION();
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
@@ -409,7 +415,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			MarkBufferDirty(childbuf);
 		}
 
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 			ginxlogInsert xlrec;
@@ -547,6 +553,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * OK, we have the new contents of the left page in a temporary copy
 		 * now (newlpage), and likewise for the new contents of the
@@ -587,7 +596,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		}
 
 		/* write WAL record */
-		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 7c76d1f90d..c0d9fe3069 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
@@ -811,6 +812,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	if (removedsomething)
 	{
 		bool		modified;
+		bool		needwal;
 
 		/*
 		 * Make sure we have a palloc'd copy of all segments, after the first
@@ -835,8 +837,12 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 			}
 		}
 
-		if (RelationNeedsWAL(indexrel))
+		needwal = RelationNeedsWAL(indexrel);
+		if (needwal)
+		{
 			computeLeafRecompressWALData(leaf);
+			CheckWALPermitted();
+		}
 
 		/* Apply changes to page */
 		START_CRIT_SECTION();
@@ -845,7 +851,7 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(indexrel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -1777,6 +1783,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	int			nrootitems;
 	int			rootsize;
 	bool		is_build = (buildStats != NULL);
+	bool		needwal;
 
 	/* Construct the new root page in memory first. */
 	tmppage = (Page) palloc(BLCKSZ);
@@ -1825,12 +1832,17 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	 */
 	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogCreatePostingTree data;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 7409fdc165..6cb788ff8d 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -20,6 +20,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_am.h"
@@ -68,6 +69,8 @@ writeListPage(Relation index, Buffer buffer,
 	PGAlignedBlock workspace;
 	char	   *ptr;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	GinInitBuffer(buffer, GIN_LIST);
@@ -548,6 +551,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 	Page		metapage;
 	GinMetaPageData *metadata;
 	BlockNumber blknoToDelete;
+	bool		needwal = RelationNeedsWAL(index);
 
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
@@ -586,8 +590,11 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 		 * prepare the XLogInsert machinery for that before entering the
 		 * critical section.
 		 */
-		if (RelationNeedsWAL(index))
+		if (needwal)
+		{
 			XLogEnsureRecordSpace(data.ndeleted, 0);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -625,7 +632,7 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
 			MarkBufferDirty(buffers[i]);
 		}
 
-		if (RelationNeedsWAL(index))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index ea1c4184fb..35cdef3201 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
@@ -447,6 +448,9 @@ ginbuildempty(Relation index)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3d15701a01..a652509ca5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/reloptions.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
@@ -659,12 +660,18 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 	Buffer		metabuffer;
 	Page		metapage;
 	GinMetaPageData *metadata;
+	bool		needwal;
 
 	metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
 	LockBuffer(metabuffer, GIN_EXCLUSIVE);
 	metapage = BufferGetPage(metabuffer);
 	metadata = GinPageGetMeta(metapage);
 
+	needwal = RelationNeedsWAL(index) && !is_build;
+
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	metadata->nTotalPages = stats->nTotalPages;
@@ -684,7 +691,7 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	MarkBufferDirty(metabuffer);
 
-	if (RelationNeedsWAL(index) && !is_build)
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogUpdateMeta data;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index b4fa5f6bf8..4299ed955a 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -136,6 +137,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	bool		needwal;
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -159,6 +161,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 */
 	PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
 
+	needwal = RelationNeedsWAL(gvs->index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* Unlink the page by changing left sibling's rightlink */
@@ -195,7 +201,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	MarkBufferDirty(lBuffer);
 	MarkBufferDirty(dBuffer);
 
-	if (RelationNeedsWAL(gvs->index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 		ginxlogDeletePage data;
@@ -651,6 +657,9 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 		if (resPage)
 		{
+			if (RelationNeedsWAL(gvs.index))
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 			PageRestoreTempPage(resPage, page);
 			MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8c6c744ab7..3982089d68 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include "access/gist_private.h"
 #include "access/gistscan.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "commands/vacuum.h"
@@ -137,6 +138,9 @@ gistbuildempty(Relation index)
 	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Index building transactions will always have a valid XID */
+	AssertWALPermittedHaveXID();
+
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
@@ -237,6 +241,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -471,8 +476,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * insertion for that. NB: The number of pages and data segments
 		 * specified here must match the calculations in gistXLogSplit()!
 		 */
-		if (!is_build && RelationNeedsWAL(rel))
+		if (!is_build && needwal)
+		{
 			XLogEnsureRecordSpace(npage, 1 + npage * 2);
+			CheckWALPermitted();
+		}
 
 		START_CRIT_SECTION();
 
@@ -506,7 +514,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 				recptr = gistXLogSplit(is_leaf,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
@@ -532,6 +540,9 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	}
 	else
 	{
+		if (!is_build && needwal)
+			CheckWALPermitted();
+
 		/*
 		 * Enough space.  We always get here if ntup==0.
 		 */
@@ -573,7 +584,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			recptr = GistBuildLSN;
 		else
 		{
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				OffsetNumber ndeloffs = 0,
 							deloffs[1];
@@ -1647,6 +1658,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				maxoff;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	Assert(GistPageIsLeaf(page));
 
@@ -1669,11 +1681,14 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	{
 		TransactionId latestRemovedXid = InvalidTransactionId;
 
-		if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		if (XLogStandbyInfoActive() && needwal)
 			latestRemovedXid =
 				index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
 													 deletable, ndeletable);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
@@ -1690,7 +1705,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index f190decdff..ffed0e75e8 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -278,6 +279,7 @@ gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Buffer		buffer;
 	Page		page;
 	BlockNumber recurse_to;
+	bool		needwal = RelationNeedsWAL(rel);
 
 restart:
 	recurse_to = InvalidBlockNumber;
@@ -359,6 +361,9 @@ restart:
 		 */
 		if (ntodelete > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
@@ -366,7 +371,7 @@ restart:
 			PageIndexMultiDelete(page, todelete, ntodelete);
 			GistMarkTuplesDeleted(page);
 
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				XLogRecPtr	recptr;
 
@@ -595,6 +600,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
 	FullTransactionId txid;
+	bool		needwal = RelationNeedsWAL(info->index);
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -649,6 +655,9 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	txid = ReadNextFullTransactionId();
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
@@ -661,7 +670,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(parentBuffer);
 	PageIndexTupleDelete(parentPage, downlink);
 
-	if (RelationNeedsWAL(info->index))
+	if (needwal)
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
 		recptr = gistGetFakeLSN(info->index);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fd1a7119b6..2fa9f45105 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,6 +22,7 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
@@ -470,6 +471,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	bool		needwal;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -576,6 +578,10 @@ loop_top:
 		goto loop_top;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Okay, we're really done.  Update tuple count in metapage. */
 	START_CRIT_SECTION();
 
@@ -606,7 +612,7 @@ loop_top:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_update_meta_page xlrec;
 		XLogRecPtr	recptr;
@@ -693,6 +699,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	bool		needwal = RelationNeedsWAL(rel);
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -791,6 +798,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			if (needwal)
+				CheckWALPermitted();
+
 			/* No ereport(ERROR) until changes are logged */
 			START_CRIT_SECTION();
 
@@ -812,7 +822,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (needwal)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -886,6 +896,9 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = HashPageGetOpaque(page);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -893,7 +906,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 4f2fecb908..e1be1761e8 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
@@ -194,6 +195,8 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	AssertWALPermittedHaveXID();
+
 	/* Do the update.  No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -361,6 +364,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	if (ndeletable > 0)
 	{
 		TransactionId latestRemovedXid;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
@@ -371,6 +375,9 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		 */
 		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -394,7 +401,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e34cfc302d..1cb73610fc 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -19,6 +19,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
@@ -313,6 +314,8 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 found:
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
 	 * log the changes for bitmap page and overflow page together to avoid
@@ -511,6 +514,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
 	bool		update_metap = false;
+	bool		needwal;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -574,9 +578,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	needwal = RelationNeedsWAL(rel);
+
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
+	{
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+		CheckWALPermitted();
+	}
 
 	START_CRIT_SECTION();
 
@@ -642,7 +651,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -923,14 +932,19 @@ readpage:
 
 				if (nitups > 0)
 				{
+					bool		needwal = RelationNeedsWAL(rel);
+
 					Assert(nitups == ndeletable);
 
 					/*
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
+					{
 						XLogEnsureRecordSpace(0, 3 + nitups);
+						CheckWALPermitted();
+					}
 
 					START_CRIT_SECTION();
 
@@ -948,7 +962,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (needwal)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 39206d1942..9521585706 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -30,6 +30,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -817,6 +818,8 @@ restart_expand:
 		goto fail;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
@@ -1173,6 +1176,8 @@ _hash_splitbucket(Relation rel,
 
 				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
+					AssertWALPermittedHaveXID();
+
 					/*
 					 * Change the shared buffer state in critical section,
 					 * otherwise any error could make it unrecoverable.
@@ -1224,6 +1229,8 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			AssertWALPermittedHaveXID();
+
 			/*
 			 * Change the shared buffer state in critical section, otherwise
 			 * any error could make it unrecoverable.
@@ -1270,6 +1277,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = HashPageGetOpaque(npage);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1ee985f633..d7d8015e87 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/valid.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -2059,6 +2060,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2343,6 +2346,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN))
 			all_frozen_set = true;
 
+		AssertWALPermittedHaveXID();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -2903,6 +2908,8 @@ l1:
 							  xid, LockTupleExclusive, true,
 							  &new_xmax, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -3652,6 +3659,8 @@ l2:
 
 		Assert(HEAP_XMAX_IS_LOCKED_ONLY(infomask_lock_old_tuple));
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Clear obsolete visibility flags ... */
@@ -3837,6 +3846,8 @@ l2:
 										   id_has_external,
 										   &old_key_copied);
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -4818,6 +4829,8 @@ failed:
 							  GetCurrentTransactionId(), mode, false,
 							  &xid, &new_infomask, &new_infomask2);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -5608,6 +5621,8 @@ l4:
 								VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5766,6 +5781,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	StaticAssertStmt(MaxOffsetNumber < SpecTokenOffsetNumber,
 					 "invalid speculative token constant");
 
+	AssertWALPermittedHaveXID();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -5874,6 +5891,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 		elog(ERROR, "attempted to kill a non-speculative tuple");
 	Assert(!HeapTupleHeaderIsHeapOnly(tp.t_data));
 
+	AssertWALPermittedHaveXID();
+
 	/*
 	 * No need to check for serializable conflicts here.  There is never a
 	 * need for a combo CID, either.  No need to extract replica identity, or
@@ -5994,6 +6013,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	bool		needwal;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6024,6 +6044,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -6034,7 +6058,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (needwal)
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 98d31de003..cfa3976717 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -18,6 +18,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
@@ -115,11 +116,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	Size		minfree;
 
 	/*
-	 * We can't write WAL in recovery mode, so there's no point trying to
-	 * clean the page. The primary will likely issue a cleaning WAL record
-	 * soon anyway, so this is no particular loss.
+	 * We can't write if WAL is prohibited or recovery is in progress, so
+	 * there's no point trying to clean the page. The primary will likely issue
+	 * a cleaning WAL record soon anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 		return;
 
 	/*
@@ -277,6 +278,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				maxoff;
 	PruneState	prstate;
 	HeapTupleData tup;
+	bool		needwal;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -380,6 +382,10 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	needwal = RelationNeedsWAL(relation);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -413,7 +419,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/*
 		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (needwal)
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e1cac74e62..ca6bf27539 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -42,6 +42,7 @@
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1480,6 +1481,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 		 */
 		if (!PageIsAllVisible(page))
 		{
+			bool 		needwal = RelationNeedsWAL(vacrel->rel);
+
+			if (needwal)
+				CheckWALPermitted();
+
 			START_CRIT_SECTION();
 
 			/* mark buffer dirty before writing a WAL record */
@@ -1494,8 +1500,7 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 * To prevent that, check whether the page has been previously
 			 * WAL-logged, and if not, do that now.
 			 */
-			if (RelationNeedsWAL(vacrel->rel) &&
-				PageGetLSN(page) == InvalidXLogRecPtr)
+			if (needwal && PageGetLSN(page) == InvalidXLogRecPtr)
 				log_newpage_buffer(buf, true);
 
 			PageSetAllVisible(page);
@@ -1811,8 +1816,13 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
+		bool		needwal = RelationNeedsWAL(vacrel->rel);
+
 		Assert(prunestate->hastup);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1837,7 +1847,7 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (needwal)
 		{
 			XLogRecPtr	recptr;
 
@@ -2497,6 +2507,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
+	bool		needwal = RelationNeedsWAL(vacrel->rel);
 
 	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
 
@@ -2507,6 +2518,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
 							 InvalidOffsetNumber);
 
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	for (; index < dead_items->num_items; index++)
@@ -2537,7 +2551,7 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(vacrel->rel))
+	if (needwal)
 	{
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e09f25a684..1bfe0a5fb3 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
@@ -250,6 +251,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8	   *map;
+	bool		needwal = RelationNeedsWAL(rel);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -273,12 +275,19 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 
 	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
+		/*
+		 * Can reach here from VACUUM or from startup process, so need not have an
+		 * XID.
+		 */
+		if (needwal && XLogRecPtrIsInvalid(recptr))
+			CheckWALPermitted();
+
 		START_CRIT_SECTION();
 
 		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			if (XLogRecPtrIsInvalid(recptr))
 			{
@@ -475,6 +484,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		Buffer		mapBuffer;
 		Page		page;
 		char	   *map;
+		bool		needwal;
 
 		newnblocks = truncBlock + 1;
 
@@ -488,8 +498,13 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		page = BufferGetPage(mapBuffer);
 		map = PageGetContents(page);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -517,7 +532,7 @@ visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(mapBuffer);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(mapBuffer, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0207421a5d..7237f985bb 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,6 +16,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
@@ -236,6 +237,8 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
 		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(newpage, page);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..1530a6f371 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -18,6 +18,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "common/pg_prng.h"
 #include "lib/qunique.h"
@@ -1241,6 +1242,7 @@ _bt_insertonpg(Relation rel,
 		Page		metapg = NULL;
 		BTMetaPageData *metad = NULL;
 		BlockNumber blockcache;
+		bool		needwal = RelationNeedsWAL(rel);
 
 		/*
 		 * If we are doing this insert because we split a page that was the
@@ -1266,6 +1268,9 @@ _bt_insertonpg(Relation rel,
 			}
 		}
 
+		if (needwal)
+			CheckWALPermitted();
+
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
@@ -1304,7 +1309,7 @@ _bt_insertonpg(Relation rel,
 		}
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
@@ -1489,6 +1494,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	bool		newitemonleft,
 				isleaf,
 				isrightmost;
+	bool		needwal;
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1916,13 +1922,18 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			ropaque->btpo_flags |= BTP_SPLIT_END;
 	}
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/*
 	 * Right sibling is locked, new siblings are prepared, but original page
 	 * is not updated yet.
 	 *
 	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
 	 * not starting the critical section till here because we haven't been
-	 * scribbling on the original page yet; see comments above.
+	 * scribbling on the original page yet; see the comments above for grabbing
+	 * the right sibling.
 	 */
 	START_CRIT_SECTION();
 
@@ -1959,7 +1970,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_split xlrec;
 		uint8		xlinfo;
@@ -2447,6 +2458,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
@@ -2484,6 +2496,10 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	right_item = CopyIndexTuple(item);
 	BTreeTupleSetDownLink(right_item, rbkno);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
@@ -2541,7 +2557,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_newroot xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 20adb602a4..1f0c476208 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -26,6 +26,7 @@
 #include "access/nbtxlog.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
@@ -236,6 +237,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	bool		needwal;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -272,6 +274,10 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_unlockbuf(rel, metabuf);
 	_bt_lockbuf(rel, metabuf, BT_WRITE);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
@@ -284,7 +290,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_metadata md;
 		XLogRecPtr	recptr;
@@ -403,6 +409,7 @@ _bt_getroot(Relation rel, int access)
 	if (metad->btm_root == P_NONE)
 	{
 		Page		metapg;
+		bool		needwal;
 
 		/* If access = BT_READ, caller doesn't want us to create root yet */
 		if (access == BT_READ)
@@ -448,6 +455,10 @@ _bt_getroot(Relation rel, int access)
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
 
+		needwal = RelationNeedsWAL(rel);
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO ELOG(ERROR) till meta is updated */
 		START_CRIT_SECTION();
 
@@ -466,7 +477,7 @@ _bt_getroot(Relation rel, int access)
 		MarkBufferDirty(metabuf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (needwal)
 		{
 			xl_btree_newroot xlrec;
 			XLogRecPtr	recptr;
@@ -1183,6 +1194,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	if (needswal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1313,6 +1327,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 										 updatedoffsets, &updatedbuflen,
 										 needswal);
 
+	AssertWALPermittedHaveXID();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2097,6 +2113,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = BTPageGetOpaque(page);
@@ -2185,6 +2202,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2236,7 +2257,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_mark_page_halfdead xlrec;
 		XLogRecPtr	recptr;
@@ -2323,6 +2344,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	bool		needwal;
 
 	page = BufferGetPage(leafbuf);
 	opaque = BTPageGetOpaque(page);
@@ -2554,6 +2576,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * Here we begin doing the deletion.
 	 */
 
+	needwal = RelationNeedsWAL(rel);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -2630,7 +2656,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (needwal)
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index e84b5edc03..6d3adcd0e3 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "common/pg_prng.h"
 #include "miscadmin.h"
@@ -215,6 +216,8 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 	xlrec.offnumParent = InvalidOffsetNumber;
 	xlrec.nodeI = 0;
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	if (current->offnum == InvalidOffsetNumber ||
@@ -459,6 +462,8 @@ moveLeafs(Relation index, SpGistState *state,
 
 	leafdata = leafptr = palloc(size);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/* copy all the old tuples to new page, unless they're dead */
@@ -1132,6 +1137,8 @@ doPickSplit(Relation index, SpGistState *state,
 
 	leafdata = leafptr = (char *) palloc(totalLeafSizes);
 
+	AssertWALPermittedHaveXID();
+
 	/* Here we begin making the changes to the target pages */
 	START_CRIT_SECTION();
 
@@ -1540,6 +1547,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	if (PageGetExactFreeSpace(current->page) >=
 		newInnerTuple->size - innerTuple->size)
 	{
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * We can replace the inner tuple by new version in-place
 		 */
@@ -1626,6 +1635,8 @@ spgAddNodeAction(Relation index, SpGistState *state,
 		else
 			xlrec.parentBlk = 2;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* insert new ... */
@@ -1811,6 +1822,8 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 									&xlrec.newPage);
 	}
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0049630532..acd6c89270 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -19,6 +19,7 @@
 #include "access/spgist_private.h"
 #include "access/spgxlog.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "catalog/storage_xlog.h"
 #include "commands/vacuum.h"
@@ -139,6 +140,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	int			nDeletable;
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	memset(predecessor, 0, sizeof(predecessor));
 	memset(deletable, 0, sizeof(deletable));
@@ -323,6 +325,10 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 	if (nDeletable != xlrec.nDead + xlrec.nPlaceholder + xlrec.nMove)
 		elog(ERROR, "inconsistent counts of deletable tuples");
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the updates */
 	START_CRIT_SECTION();
 
@@ -371,7 +377,7 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -411,6 +417,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	OffsetNumber toDelete[MaxIndexTuplesPerPage];
 	OffsetNumber i,
 				max = PageGetMaxOffsetNumber(page);
+	bool		needwal;
 
 	xlrec.nDelete = 0;
 
@@ -447,6 +454,10 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 	if (xlrec.nDelete == 0)
 		return;					/* nothing more to do */
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	/* Do the update */
 	START_CRIT_SECTION();
 
@@ -455,7 +466,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
 
 	MarkBufferDirty(buffer);
 
-	if (RelationNeedsWAL(index))
+	if (needwal)
 	{
 		XLogRecPtr	recptr;
 
@@ -502,6 +513,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
+	bool		needwal;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
@@ -509,6 +521,10 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	/* XXX: providing heap relation would allow more pruning */
 	vistest = GlobalVisTestFor(NULL);
 
+	needwal = RelationNeedsWAL(index);
+	if (needwal)
+		CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -584,7 +600,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	if (hasUpdate)
 		MarkBufferDirty(buffer);
 
-	if (hasUpdate && RelationNeedsWAL(index))
+	if (hasUpdate && needwal)
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9f65c600d0..f05082fccc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -1174,6 +1175,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	ExtendMultiXactMember(nextOffset, nmembers);
 
+	CheckWALPermitted();
+
 	/*
 	 * Critical section from here until caller has written the data into the
 	 * just-reserved SLRU space; we don't want to error out with a partly
@@ -2964,7 +2967,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	Assert(!RecoveryInProgress());
+	Assert(XLogInsertAllowed());
 	Assert(MultiXactState->finishedStartup);
 
 	/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b35da6f1aa..0a07d5cf7a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1184,6 +1185,8 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	XLogEnsureRecordSpace(0, records.num_chunks);
 
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
@@ -2305,6 +2308,9 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
 				  replorigin_session_origin != DoNotReplicateId);
 
+	/* COMMIT PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
@@ -2407,6 +2413,9 @@ RecordTransactionAbortPrepared(TransactionId xid,
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* ROLLBACK PREPARED need not have an XID */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 748120a012..86ccbc3c82 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -17,6 +17,7 @@
 #include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "commands/dbcommands.h"
@@ -75,6 +76,9 @@ GetNewTransactionId(bool isSubXact)
 	if (RecoveryInProgress())
 		elog(ERROR, "cannot assign TransactionIds during recovery");
 
+	/* do not assign transaction id when WAL is prohibited */
+	CheckWALPermitted();
+
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	full_xid = ShmemVariableCache->nextXid;
diff --git a/src/backend/access/transam/walprohibit.c b/src/backend/access/transam/walprohibit.c
index d968c71494..b0e527ae74 100644
--- a/src/backend/access/transam/walprohibit.c
+++ b/src/backend/access/transam/walprohibit.c
@@ -27,6 +27,16 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 
+/*
+ * Assert flag to enforce WAL insert permission check rule before starting a
+ * critical section for the WAL writes.  For this, either of
+ * CheckWALPermitted(), AssertWALPermittedHaveXID(), or AssertWALPermitted()
+ * must be called before starting the critical section.
+ */
+#ifdef USE_ASSERT_CHECKING
+WALProhibitCheckState walprohibit_checked_state = WALPROHIBIT_UNCHECKED;
+#endif
+
 /*
  * Shared-memory WAL prohibit state structure
  */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 62d1018220..0c3a207bf6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -1375,6 +1376,8 @@ RecordTransactionCommit(void)
 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();
 
+		AssertWALPermittedHaveXID();
+
 		/*
 		 * Mark ourselves as within our "commit critical section".  This
 		 * forces any concurrent checkpoint to wait until we've updated
@@ -1741,6 +1744,9 @@ RecordTransactionAbort(bool isSubXact)
 		elog(PANIC, "cannot abort transaction %u, it was already committed",
 			 xid);
 
+	/* We'll be reaching here with valid XID only. */
+	AssertWALPermittedHaveXID();
+
 	/* Fetch the data we need for the abort record */
 	nrels = smgrGetPendingDeletes(false, &rels);
 	nchildren = xactGetCommittedChildren(&children);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5447dcb085..588df3cb6d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -773,7 +773,7 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		elog(ERROR, "cannot make new WAL entries now");
 
 	/*
 	 * Given that we're not in recovery, InsertTimeLineID is set and can't
@@ -2542,9 +2542,11 @@ XLogFlush(XLogRecPtr record)
 	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
 	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
 	 * to act this way too, and because when it tries to write the
-	 * end-of-recovery checkpoint, it should indeed flush.
+	 * end-of-recovery checkpoint, it should indeed flush. Also, WAL prohibit
+	 * state should not restrict WAL flushing. Otherwise, the dirty buffer
+	 * cannot be evicted until WAL has been flushed up to the buffer's LSN.
 	 */
-	if (!XLogInsertAllowed())
+	if (!XLogInsertAllowed() && !IsWALProhibited())
 	{
 		UpdateMinRecoveryPoint(record, false);
 		return;
@@ -6759,6 +6761,9 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/* Error out if wal writes are disabled. */
+	CheckWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -6922,6 +6927,9 @@ CreateEndOfRecoveryRecord(void)
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
 	WALInsertLockRelease();
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	XLogBeginInsert();
@@ -6997,6 +7005,9 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn, XLogRecPtr pagePtr,
 		elog(ERROR, "invalid WAL insert position %X/%X for OVERWRITE_CONTRECORD",
 			 LSN_FORMAT_ARGS(recptr));
 
+	/* Assured that WAL permission has been checked */
+	AssertWALPermitted();
+
 	START_CRIT_SECTION();
 
 	/*
@@ -7665,7 +7676,7 @@ void
 UpdateFullPageWrites(void)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	bool		recoveryInProgress;
+	bool		WALInsertAllowed;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -7679,10 +7690,10 @@ UpdateFullPageWrites(void)
 
 	/*
 	 * Perform this outside critical section so that the WAL insert
-	 * initialization done by RecoveryInProgress() doesn't trigger an
-	 * assertion failure.
+	 * initialization done by XLogInsertAllowed() doesn't trigger an assertion
+	 * failure.
 	 */
-	recoveryInProgress = RecoveryInProgress();
+	WALInsertAllowed = XLogInsertAllowed();
 
 	START_CRIT_SECTION();
 
@@ -7704,8 +7715,11 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XLogStandbyInfoActive() && !recoveryInProgress)
+	if (XLogStandbyInfoActive() && WALInsertAllowed)
 	{
+		/* Assured that WAL permission has been checked */
+		AssertWALPermitted();
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&fullPageWrites), sizeof(bool));
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 462e23503e..941f449a3e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -27,6 +27,7 @@
 #include <zstd.h>
 #endif
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
@@ -153,9 +154,20 @@ XLogBeginInsert(void)
 	Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
 	Assert(mainrdata_len == 0);
 
-	/* cross-check on whether we should be here or not */
-	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+	/*
+	 * WAL permission must have checked before entering the critical section.
+	 * Otherwise, WAL prohibited error will force system panic.
+	 */
+	Assert(walprohibit_checked_state != WALPROHIBIT_UNCHECKED ||
+		   CritSectionCount == 0);
+
+	/*
+	 * Cross-check on whether we should be here or not.
+	 *
+	 * This check is primarily for a non-critical section that never insists the
+	 * same WAL write permission check before reaching here.
+	 */
+	CheckWALPermitted();
 
 	if (begininsert_called)
 		elog(ERROR, "XLogBeginInsert was already called");
@@ -233,6 +245,9 @@ XLogResetInsertion(void)
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
 	curinsert_flags = 0;
 	begininsert_called = false;
+
+	/* Reset walpermit_checked flag */
+	RESET_WALPROHIBIT_CHECKED_STATE();
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index ce776c53ca..3b7a3d1fe9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -28,6 +28,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -477,6 +478,8 @@ CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
 		xl_dbase_create_wal_log_rec xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		xlrec.db_id = dbid;
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index ddf219b21f..411f12fc25 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/table.h"
 #include "access/transam.h"
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -400,6 +401,9 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
 	if (RelationNeedsWAL(rel))
 		GetTopTransactionId();
 
+	/* Cannot have valid XID without WAL permission */
+	AssertWALPermittedHaveXID();
+
 	START_CRIT_SECTION();
 
 	MarkBufferDirty(buf);
@@ -640,6 +644,7 @@ nextval_internal(Oid relid, bool check_permissions)
 				rescnt = 0;
 	bool		cycle;
 	bool		logit = false;
+	bool		needwal;
 
 	/* open and lock sequence */
 	init_sequence(relid, &elm, &seqrel);
@@ -799,9 +804,15 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * to assign xids subxacts, that'll already trigger an appropriate wait.
 	 * (Have to do that here, so we're outside the critical section)
 	 */
-	if (logit && RelationNeedsWAL(seqrel))
+	needwal = logit && RelationNeedsWAL(seqrel);
+	if (needwal)
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -817,7 +828,7 @@ nextval_internal(Oid relid, bool check_permissions)
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (logit && RelationNeedsWAL(seqrel))
+	if (needwal)
 	{
 		xl_seq_rec	xlrec;
 		XLogRecPtr	recptr;
@@ -951,6 +962,7 @@ do_setval(Oid relid, int64 next, bool iscalled)
 	Form_pg_sequence pgsform;
 	int64		maxv,
 				minv;
+	bool		needwal;
 
 	/* open and lock sequence */
 	init_sequence(relid, &elm, &seqrel);
@@ -1001,9 +1013,15 @@ do_setval(Oid relid, int64 next, bool iscalled)
 	elm->cached = elm->last;
 
 	/* check the comment above nextval_internal()'s equivalent call. */
-	if (RelationNeedsWAL(seqrel))
+	needwal = RelationNeedsWAL(seqrel);
+	if (needwal)
+	{
 		GetTopTransactionId();
 
+		/* Cannot have valid XID without WAL permission */
+		AssertWALPermittedHaveXID();
+	}
+
 	/* ready to change the on-disk (or really, in-buffer) tuple */
 	START_CRIT_SECTION();
 
@@ -1014,7 +1032,7 @@ do_setval(Oid relid, int64 next, bool iscalled)
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(seqrel))
+	if (needwal)
 	{
 		xl_seq_rec	xlrec;
 		XLogRecPtr	recptr;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 09dd645bca..4d8c965043 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -955,6 +955,10 @@ RequestCheckpoint(int flags)
 	int			old_failed,
 				old_started;
 
+	/* The checkpoint is allowed in recovery but not in WAL prohibit state */
+	if (!RecoveryInProgress())
+		CheckWALPermitted();
+
 	/*
 	 * If in a standalone backend, just do it ourselves.
 	 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 93c1ea2d9f..a8afb0dd46 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/tableam.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -3748,6 +3749,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
 
+		AssertWALPermittedHaveXID();
+
 		START_CRIT_SECTION();
 
 		/* Copy page data from the source to the destination. */
@@ -4038,13 +4041,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		{
 			/*
 			 * If we must not write WAL, due to a relfilenode-specific
-			 * condition or being in recovery, don't dirty the page.  We can
-			 * set the hint, just not dirty the page as a result so the hint
-			 * is lost when we evict the page or shutdown.
+			 * condition or in general, don't dirty the page.  We can
+			 * set the hint, but must not dirty the page as a result, lest
+			 * we trigger WAL generation. Unless the page is dirtied again
+			 * later, the hint will be lost when the page is evicted, or at
+			 * shutdown.
 			 *
 			 * See src/backend/storage/page/README for longer discussion.
 			 */
-			if (RecoveryInProgress() ||
+			if (!XLogInsertAllowed() ||
 				RelFileNodeSkippingWAL(bufHdr->tag.rnode))
 				return;
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index d41ae37090..576a69da3f 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -24,6 +24,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/walprohibit.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
@@ -285,12 +286,19 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 	 */
 	if (first_removed_slot > 0)
 	{
+		bool needwal;
+
 		buf = fsm_readbuf(rel, first_removed_address, false);
 		if (!BufferIsValid(buf))
 			return InvalidBlockNumber;	/* nothing to do; the FSM was already
 										 * smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		needwal = (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded());
+
+		if (needwal)
+			CheckWALPermitted();
+
 		/* NO EREPORT(ERROR) from here till changes are logged */
 		START_CRIT_SECTION();
 
@@ -305,7 +313,7 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
 		 * during recovery.
 		 */
 		MarkBufferDirty(buf);
-		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+		if (needwal)
 			log_newpage_buffer(buf, false);
 
 		END_CRIT_SECTION();
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index dee3387d02..91e963bff1 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -43,6 +43,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/walprohibit.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -905,6 +906,8 @@ write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
 		xl_relmap_update xlrec;
 		XLogRecPtr	lsn;
 
+		AssertWALPermittedHaveXID();
+
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
 
diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
index d71522cbf3..41bc221dbf 100644
--- a/src/include/access/walprohibit.h
+++ b/src/include/access/walprohibit.h
@@ -13,6 +13,7 @@
 
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "nodes/parsenodes.h"
 
@@ -49,6 +50,49 @@ CounterGetWALProhibitState(uint32 wal_prohibit_counter)
 	return (WALProhibitState) (wal_prohibit_counter & 3);
 }
 
+/* Never reaches when WAL is prohibited. */
+static inline void
+AssertWALPermitted(void)
+{
+	Assert(XLogInsertAllowed());
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
+/*
+ * XID-bearing transactions are killed off by executing pg_prohibit_wal()
+ * function, so any part of the code that can only be reached with an XID
+ * assigned is never reached when WAL is prohibited.
+ */
+static inline void
+AssertWALPermittedHaveXID(void)
+{
+	/* Must be performing an INSERT, UPDATE or DELETE, so we'll have an XID */
+	Assert(FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()));
+	AssertWALPermitted();
+}
+
+/*
+ * In opposite to the above assertion if a transaction doesn't have valid XID
+ * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
+ * prohibited.  Therefore, we need to explicitly error out before entering into
+ * the critical section.
+ */
+static inline void
+CheckWALPermitted(void)
+{
+	if (!XLogInsertAllowed())
+		ereport(ERROR,
+				(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+				 errmsg("WAL is now prohibited")));
+
+#ifdef USE_ASSERT_CHECKING
+	walprohibit_checked_state = WALPROHIBIT_CHECKED;
+#endif
+}
+
 extern bool ProcessBarrierWALProhibit(void);
 extern void MarkCheckPointSkippedInWalProhibitState(void);
 extern void WALProhibitStateCounterInit(bool wal_prohibited);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bcf2016421..4510d01db8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -107,6 +107,30 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+#ifdef USE_ASSERT_CHECKING
+typedef enum
+{
+	WALPROHIBIT_UNCHECKED,
+	WALPROHIBIT_CHECKED,
+	WALPROHIBIT_CHECKED_AND_USED
+} WALProhibitCheckState;
+
+/* in access/walprohibit.c */
+extern PGDLLIMPORT WALProhibitCheckState walprohibit_checked_state;
+
+/*
+ * Reset walpermit_checked flag when no longer in the critical section.
+ * Otherwise, marked checked and used.
+ */
+#define RESET_WALPROHIBIT_CHECKED_STATE() \
+do { \
+	walprohibit_checked_state = (CritSectionCount == 0) ? \
+	WALPROHIBIT_UNCHECKED : WALPROHIBIT_CHECKED_AND_USED; \
+} while(0)
+#else
+#define RESET_WALPROHIBIT_CHECKED_STATE() ((void) 0)
+#endif
+
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
@@ -122,6 +146,7 @@ extern void ProcessInterrupts(void);
 do { \
 	if (INTERRUPTS_PENDING_CONDITION()) \
 		ProcessInterrupts(); \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 /* Is ProcessInterrupts() guaranteed to clear InterruptPending? */
@@ -151,6 +176,7 @@ do { \
 do { \
 	Assert(CritSectionCount > 0); \
 	CritSectionCount--; \
+	RESET_WALPROHIBIT_CHECKED_STATE(); \
 } while(0)
 
 
-- 
2.18.0

v45-0001-Create-XLogAcceptWrites-function-with-code-from-.patchapplication/octet-stream; name=v45-0001-Create-XLogAcceptWrites-function-with-code-from-.patchDownload

From 4da892c00e78850eee9cdd4a5c9d0342846179cc Mon Sep 17 00:00:00 2001
From: Amul Sul <amul.sul@enterprisedb.com>
Date: Mon, 4 Oct 2021 00:44:31 -0400
Subject: [PATCH v45 1/6] Create XLogAcceptWrites() function with code from
 StartupXLOG().

This is just code movement. A future patch will want to defer the
call to XLogAcceptWrites() until a later time, rather than doing
it as soon as we finish applying WAL, but here we're just grouping
related code together into a new function.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 100 ++++++++++++++++++------------
 1 file changed, 61 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6770c3ddba..7e7e99a850 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -665,6 +665,10 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
+static bool XLogAcceptWrites(bool performedWalRecovery, TimeLineID newTLI,
+							 TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+							 XLogRecPtr abortedRecPtr,
+							 XLogRecPtr missingContrecPtr);
 static bool PerformRecoveryXLogAction(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
@@ -5527,45 +5531,9 @@ StartupXLOG(void)
 	/* Shut down xlogreader */
 	ShutdownWalRecovery();
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
-	}
-
-	/*
-	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
-	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
-	 */
-	Insert->fullPageWrites = lastFullPageWrites;
-	UpdateFullPageWrites();
-
-	/*
-	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
-	 */
-	if (performedWalRecovery)
-		promoted = PerformRecoveryXLogAction();
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	XLogReportParameters();
-
-	/* If this is archive recovery, perform post-recovery cleanup actions. */
-	if (ArchiveRecoveryRequested)
-		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
-
-	/*
-	 * Local WAL inserts enabled, so it's time to finish initialization of
-	 * commit timestamp.
-	 */
-	CompleteCommitTsInitialization();
+	/* Prepare to accept WAL writes. */
+	promoted = XLogAcceptWrites(performedWalRecovery, newTLI, EndOfLogTLI,
+								EndOfLog, abortedRecPtr, missingContrecPtr);
 
 	/*
 	 * All done with end-of-recovery actions.
@@ -5620,6 +5588,60 @@ StartupXLOG(void)
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
+/*
+ * Prepare to accept WAL writes.
+ */
+static bool
+XLogAcceptWrites(bool performedWalRecovery, TimeLineID newTLI,
+				 TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog,
+				 XLogRecPtr abortedRecPtr, XLogRecPtr missingContrecPtr)
+{
+	bool		promoted = false;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
+	}
+
+	/*
+	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
+	 * record before resource manager writes cleanup WAL records or checkpoint
+	 * record is written.
+	 */
+	Insert->fullPageWrites = lastFullPageWrites;
+	UpdateFullPageWrites();
+
+	/*
+	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
+	 */
+	if (performedWalRecovery)
+		promoted = PerformRecoveryXLogAction();
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	XLogReportParameters();
+
+	/* If this is archive recovery, perform post-recovery cleanup actions. */
+	if (ArchiveRecoveryRequested)
+		CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI);
+
+	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization of
+	 * commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	return promoted;
+}
+
 /*
  * Callback from PerformWalRecovery(), called when we switch from crash
  * recovery to archive recovery mode.  Updates the control file accordingly.
-- 
2.18.0

#196

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Amul Sul (#100)

Re: [Patch] ALTER SYSTEM READ ONLY

On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:

It is a very minor change, so I rebased the patch. Please take a look, if that works for you.

Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.

Thanks Amul! I haven't looked at the whole thread, I may be repeating
things here, please bear with me.

1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
think the patches still have ALTER SYSTEM READ ONLY references.
2) IIUC, the idea of this patch is not to generate any new WAL when
set as default_transaction_read_only and transaction_read_only can't
guarantee that?
3) IMO, the function name pg_prohibit_wal doesn't look good where it
also allows one to set WAL writes, how about the following functions -
pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
pg_allow_wal or pg_allow_wal_{generation, inserts} without any
arguments and if needed a common function
pg_set_wal_generation_state(read-only/read-write) something like that?
4) It looks like only the checkpointer is setting the WAL prohibit
state? Is there a strong reason for that? Why can't the backend take a
lock on prohibit state in shared memory and set it and let the
checkpointer read it and block itself from writing WAL?
5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
checkpointer? Why?
6) What happens for long-running or in-progress transactions if
someone prohibits WAL in the midst of them? Do these txns fail? Or do
we say that we will allow them to run to completion? Or do we fail
those txns at commit time? One might use this feature to say not let
server go out of disk space, but if we allow in-progress txns to
generate/write WAL, then how can one achieve that with this feature?
Say, I monitor my server in such a way that at 90% of disk space,
prohibit WAL to avoid server crash. But if this feature allows
in-progress txns to generate WAL, then the server may still crash?
7) What are the other use-cases (I can think of - to avoid out of disk
crashes, block/freeze writes to database when the server is
compromised) with this feature? Any usages during/before failover,
promotion or after it?
8) Is there a strong reason that we've picked up conditional variable
wal_prohibit_cv over mutex/lock for updating WALProhibit shared
memory?
9) Any tests that you are planning to add?

Regards,
Bharath Rupireddy.

#197

Amul Sul

sulamul@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#196)

Re: [Patch] ALTER SYSTEM READ ONLY

On Sat, Apr 23, 2022 at 1:34 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:

It is a very minor change, so I rebased the patch. Please take a look, if that works for you.

Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.

Thanks Amul! I haven't looked at the whole thread, I may be repeating
things here, please bear with me.

Np, thanks for looking into it.

1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
think the patches still have ALTER SYSTEM READ ONLY references.

Could you please point me to what those references are? I didn't find
any in the v45 version.

2) IIUC, the idea of this patch is not to generate any new WAL when
set as default_transaction_read_only and transaction_read_only can't
guarantee that?

No. Complete WAL write should be disabled, in other words XLogInsert()
should be restricted.

3) IMO, the function name pg_prohibit_wal doesn't look good where it
also allows one to set WAL writes, how about the following functions -
pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
pg_allow_wal or pg_allow_wal_{generation, inserts} without any
arguments and if needed a common function
pg_set_wal_generation_state(read-only/read-write) something like that?

There are already similar suggestions before too, but none of that
finalized yet, there are other more challenges that need to be
handled, so we can keep this work at last.

4) It looks like only the checkpointer is setting the WAL prohibit
state? Is there a strong reason for that? Why can't the backend take a
lock on prohibit state in shared memory and set it and let the
checkpointer read it and block itself from writing WAL?

Once WAL prohibited state transition is initiated and should be
completed, there is no fallback. What if the backed exit before the
complete transition? Similarly, even if the checkpointer exits,
that will be restarted again and will complete the state transition.

5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
checkpointer? Why?

Simply want to wake up the checkpointer process without asking for
specific work in the handle function. Another suitable choice will be
SIGINT, we can choose that too if needed.

6) What happens for long-running or in-progress transactions if
someone prohibits WAL in the midst of them? Do these txns fail? Or do
we say that we will allow them to run to completion? Or do we fail
those txns at commit time? One might use this feature to say not let
server go out of disk space, but if we allow in-progress txns to
generate/write WAL, then how can one achieve that with this feature?
Say, I monitor my server in such a way that at 90% of disk space,
prohibit WAL to avoid server crash. But if this feature allows
in-progress txns to generate WAL, then the server may still crash?

Read-only transactions will be allowed to continue, and if that
transaction tries to write or any other transaction that has performed
any writes already then the session running that transaction will be
terminated -- the design is described in the first mail of this
thread.

7) What are the other use-cases (I can think of - to avoid out of disk
crashes, block/freeze writes to database when the server is
compromised) with this feature? Any usages during/before failover,
promotion or after it?

The important use case is for failover to avoid split-brain situations.

8) Is there a strong reason that we've picked up conditional variable
wal_prohibit_cv over mutex/lock for updating WALProhibit shared
memory?

I am not sure how that can be done using mutex or lock.

9) Any tests that you are planning to add?

Yes, we can. I have added very sophisticated tests that cover most of
my code changes, but that is not enough for such critical code
changes, have a lot of chances of improvement and adding more tests
for this module as well as other parts e.g. some missing coverage of
gin, gists, brin, core features where this patch is adding checks, etc.
Any help will be greatly appreciated.

Regards,
Amul

#198

Jacob Champion

jchampion@timescale.com

over 3 years ago

In reply to: Amul Sul (#195)

Re: [Patch] ALTER SYSTEM READ ONLY

On Fri, Apr 8, 2022 at 7:27 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is rebase version for the latest maste head(#891624f0ec).

Hi Amul,

I'm going through past CF triage emails today; I noticed that this
patch dropped out of the commitfest when you withdrew it in January,
but it hasn't been added back with the most recent patchset you
posted. Was that intended, or did you want to re-register it for
review?

--Jacob

#199

Amul Sul

sulamul@gmail.com

over 3 years ago

In reply to: Jacob Champion (#198)

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion <jchampion@timescale.com> wrote:

On Fri, Apr 8, 2022 at 7:27 AM Amul Sul <sulamul@gmail.com> wrote:

Attached is rebase version for the latest maste head(#891624f0ec).

Hi Amul,

I'm going through past CF triage emails today; I noticed that this
patch dropped out of the commitfest when you withdrew it in January,
but it hasn't been added back with the most recent patchset you
posted. Was that intended, or did you want to re-register it for
review?

Yes, there is a plan to re-register it again but not anytime soon,
once we start to rework the design.

Regards,
Amul