standby recovery fails (tablespace related) (tentative patch and discussion)

Started by Paul Guoover 6 years ago134 messages

pguo@pivotal.io

over 6 years ago

1 attachment(s)

Hello postgres hackers,

Recently my colleagues and I encountered an issue: a standby can
not recover after an unclean shutdown and it's related to tablespace.
The issue is that the standby re-replay some xlog that needs tablespace
directories (e.g. create a database with tablespace),
but the tablespace directories has already been removed in the
previous replay.

In details, the standby normally finishes replaying for the below
operations, but due to unclean shutdown, the redo lsn
is not updated in pg_control and is still kept a value before the 'create
db with tabspace' xlog, however since the tablespace
directories were removed so it reports error when repay the database create
wal.

create db with tablespace
drop database
drop tablespace.

Here is the log on the standby.
2019-04-17 14:52:14.926 CST [23029] LOG: starting PostgreSQL 12devel on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat
4.8.5-4), 64-bit
2019-04-17 14:52:14.927 CST [23029] LOG: listening on IPv4 address
"192.168.35.130", port 5432
2019-04-17 14:52:14.929 CST [23029] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5432"
2019-04-17 14:52:14.943 CST [23030] LOG: database system was interrupted
while in recovery at log time 2019-04-17 14:48:27 CST
2019-04-17 14:52:14.943 CST [23030] HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2019-04-17 14:52:14.949 CST [23030] LOG: entering standby mode

2019-04-17 14:52:14.950 CST [23030] LOG: redo starts at 0/30105B8

2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
2019-04-17 14:52:14.951 CST [23029] LOG: startup process (PID 23030)
exited with exit code 1
2019-04-17 14:52:14.951 CST [23029] LOG: terminating any other active
server processes
2019-04-17 14:52:14.953 CST [23029] LOG: database system is shut down

Steps to reprodce:

1. setup a master and standby.
2. On both side, run: mkdir /tmp/some_isolation2_pg_basebackup_tablespace

3. Run SQLs:
drop tablespace if exists some_isolation2_pg_basebackup_tablespace;
create tablespace some_isolation2_pg_basebackup_tablespace location
'/tmp/some_isolation2_pg_basebackup_tablespace';

3. Clean shutdown and restart both postgres instances.

4. Run the following SQLs:

drop database if exists some_database_with_tablespace;
create database some_database_with_tablespace tablespace
some_isolation2_pg_basebackup_tablespace;
drop database some_database_with_tablespace;
drop tablespace some_isolation2_pg_basebackup_tablespace;
\! pkill -9 postgres; ssh host70 pkill -9 postgres

Note immediate shutdown via pg_ctl should also be able to reproduce and the
above steps probably does not 100% reproduce.

I created an initial patch for this issue (see the attachment). The idea is
re-creating those directories recursively. The above issue exists
in dbase_redo(),
but TablespaceCreateDbspace (for relation file create redo) is probably
buggy also so I modified that function also. Even there is no bug
in that function, it seems that using simple pg_mkdir_p() is cleaner. Note
reading TablespaceCreateDbspace(), I found it seems that this issue
has already be thought though insufficient but frankly this solution
(directory recreation) seems to be not perfect given actually this should
have been the responsibility of tablespace creation (also tablespace
creation does more like symlink creation, etc). Also, I'm not sure whether
we need to use invalid page mechanism (see xlogutils.c).

Another solution is that, actually, we create a checkpoint when
createdb/movedb/dropdb/droptablespace, maybe we should enforce to create
restartpoint on standby for such special kind of checkpoint wal - that
means we need to set a flag in checkpoing wal and let checkpoint redo
code to create restartpoint if that flag is set. This solution seems to be
safer.

Thanks,
Paul

Attachments:

0001-Recursively-create-tablespace-directories-if-those-a.patchapplication/octet-stream; name=0001-Recursively-create-tablespace-directories-if-those-a.patchDownload

From 7beccf4b41ebf8acf83ea706e2869e48867a40d3 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Wed, 17 Apr 2019 00:12:31 -0700
Subject: [PATCH] Recursively create tablespace directories if those are gone
 but we need that when re-redoing some tablespace related xlogs (e.g. database
 create).

---
 src/backend/commands/dbcommands.c | 18 ++++++++++++++++++
 src/backend/commands/tablespace.c | 28 +---------------------------
 2 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9707afa..9999b9b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,6 +45,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2089,6 +2090,7 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
@@ -2107,6 +2109,22 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that the tablespace was later dropped, but we are
+			 * re-redoing database create before that. In that case,
+			 * those directories are gone, and we do not create symlink.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (stat(parent_path, &st) != 0 && pg_mkdir_p(parent_path, pg_dir_create_mode) != 0)
+			{
+				ereport(WARNING,
+						(errmsg("can not recursively create directory \"%s\"",
+								parent_path)));
+			}
+		}
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 3784ea4..f0fac11 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -154,8 +154,6 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -168,32 +166,8 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
-- 
1.8.3.1

Asim R P

apraveen@pivotal.io

over 6 years ago

In reply to: Paul Guo (#1)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.

Asim

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Asim R P (#2)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Paul Guo (#3)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

ignore_dir_create_error_before_consistency.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..0bc63f48da 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/*
+		 * We might get error trying to create existing directory that is to
+		 * be removed just after.  Don't issue ERROR in the case. Recovery
+		 * will stop if we again failed after reaching consistency.
+		 */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
 	char		fromfile[MAXPGPATH * 2];
 	char		tofile[MAXPGPATH * 2];
 
-	if (MakePGDirectory(todir) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m", todir)));
+	/*
+	 * We might have to skip copydir to continue recovery. See the function
+	 * for details.
+	 */
+	if (XLogMakePGDirectory(todir) != 0)
+		return;
 
 	xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					   BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#4)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

The comment for the new function XLogMakePGDirectory is wrong:

+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.

The correct one is:

+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.

It is fixed in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

ignore_dir_create_error_before_consistency_v2.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..01331f0da9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/*
+		 * We might get error trying to create existing directory that is to
+		 * be removed just after.  Don't issue ERROR in the case. Recovery
+		 * will stop if we again failed after reaching consistency.
+		 */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
 	char		fromfile[MAXPGPATH * 2];
 	char		tofile[MAXPGPATH * 2];
 
-	if (MakePGDirectory(todir) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m", todir)));
+	/*
+	 * We might have to skip copydir to continue recovery. See the function
+	 * for details.
+	 */
+	if (XLogMakePGDirectory(todir) != 0)
+		return;
 
 	xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					   BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#5)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 22 Apr 2019 16:40:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.164027.33866403.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-TAP-test-to-excecise-tablespace.patchtext/x-patch; charset=us-asciiDownload

From dbe6306a730f94a5bd8beaf0e534c28ebdd815d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 Apr 2019 20:10:20 +0900
Subject: [PATCH 1/2] Allow TAP test to excecise tablespace.

To perform tablespace related checks, this patch lets
PostgresNode::backup have a new parameter "tablespace_mapping", and
make init_from_backup handle capable to handle a backup created using
tablespace_mapping.
---
 src/test/perl/PostgresNode.pm  | 10 ++++++++--
 src/test/perl/RecursiveCopy.pm | 33 +++++++++++++++++++++++++++++----
 2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 76874141c5..59a939821d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -540,13 +540,19 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mapping})
+	{
+		push(@rest, "--tablespace-mapping=$params{tablespace_mapping}");
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..c912ce412d 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::real_dir($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
-- 
2.16.3

0002-Add-check-for-recovery-failure-caused-by-tablespace.patchtext/x-patch; charset=us-asciiDownload

From 382910fbe3738c9098c0568cdc992928f471c7c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 Apr 2019 20:10:25 +0900
Subject: [PATCH 2/2] Add check for recovery failure caused by tablespace.

Removal of a tablespace on master can cause recovery failure on
standby. This patch adds the check for the case.
---
 src/test/recovery/t/011_crash_recovery.pl | 52 ++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 5dc52412ca..d1eb9edccf 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 4;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,53 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+
+# Ensure that tablespace removal doesn't cause error while recoverying
+# the preceding create datbase or objects.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::real_dir($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::real_dir($tspDir_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mapping =>
+					   "$realTSDir_master=$realTSDir_standby");
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This leaves a CREATE DATBASE WAL record
+# that is to be applied to already-removed tablespace.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE db1 WITH TABLESPACE ts1;
+						  DROP DATABASE db1;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
-- 
2.16.3

0003-Fix-failure-of-standby-startup-caused-by-tablespace-.patchtext/x-patch; charset=us-asciiDownload

From 5e3a9b682e6ec2b6cb4e019409112687bd598ebc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 Apr 2019 20:59:15 +0900
Subject: [PATCH 3/3] Fix failure of standby startup caused by tablespace
 removal

When standby restarts after a crash after drop of a tablespace,
there's a possibility that recovery fails trying an object-creation in
already removed tablespace directory. Allow recovery to continue by
ignoring the error if not reaching consistency point.
---
 src/backend/access/transam/xlogutils.c | 34 ++++++++++++++++++++++++++++++++++
 src/backend/storage/file/copydir.c     | 12 +++++++-----
 src/include/access/xlogutils.h         |  1 +
 3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..75cdb882cd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,40 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/* Don't issue ERROR for this failure before reaching consistency. */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
 	char		fromfile[MAXPGPATH * 2];
 	char		tofile[MAXPGPATH * 2];
 
-	if (MakePGDirectory(todir) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m", todir)));
+	/*
+	 * We might have to skip copydir to continue recovery. See the function
+	 * for details.
+	 */
+	if (XLogMakePGDirectory(todir) != 0)
+		return;
 
 	xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					   BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
-- 
2.16.3

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#6)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

+    # Check for symlink -- needed only on source dir
+    # (note: this will fall through quietly if file is already gone)
+    if (-l $srcpath)
+    {
+        croak "Cannot operate on symlink \"$srcpath\""
+          if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+        # We have mapped tablespaces. Copy them individually
+        my $linkname = $1;
+        my $tmpdir = TestLib::tempdir;
+        my $dstrealdir = TestLib::real_dir($tmpdir);
+        my $srcrealdir = readlink($srcpath);
+
+        opendir(my $dh, $srcrealdir);
+        while (readdir $dh)
+        {
+            next if (/^\.\.?$/);
+            my $spath = "$srcrealdir/$_";
+            my $dpath = "$dstrealdir/$_";
+
+            copypath($spath, $dpath);
+        }
+        closedir $dh;
+
+        symlink $dstrealdir, $destpath;
+        return 1;
+    }

The same stuff is proposed here:
/messages/by-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@mail.gmail.com

So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here? I am not sure
yet which one of the proposals is better than the other though.
--
Michael

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Michael Paquier (#7)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 23 Apr 2019 11:34:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20190423023438.GH2712@paquier.xyz>

On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.
+    # Check for symlink -- needed only on source dir
+    # (note: this will fall through quietly if file is already gone)
+    if (-l $srcpath)
+    {
+        croak "Cannot operate on symlink \"$srcpath\""
+          if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+        # We have mapped tablespaces. Copy them individually
+        my $linkname = $1;
+        my $tmpdir = TestLib::tempdir;
+        my $dstrealdir = TestLib::real_dir($tmpdir);
+        my $srcrealdir = readlink($srcpath);
+
+        opendir(my $dh, $srcrealdir);
+        while (readdir $dh)
+        {
+            next if (/^\.\.?$/);
+            my $spath = "$srcrealdir/$_";
+            my $dpath = "$dstrealdir/$_";
+
+            copypath($spath, $dpath);
+        }
+        closedir $dh;
+
+        symlink $dstrealdir, $destpath;
+        return 1;
+    }
The same stuff is proposed here:
/messages/by-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@mail.gmail.com

So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here? I am not sure
yet which one of the proposals is better than the other though.

TBH I like that (my one cieted above) not so much. However, I
prefer to have v12 if this is a bug and to be fixed in
v12. Otherwise we won't add a test for this later:p

Anyway I'll visit there. Thanks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#5)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Show quoted text

Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in

<CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to

fail

if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for

drop_db/create_db/drop_tablespace on

master.
That makes the file/directory update-to-date if I understand the

related

code correctly.
For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace

real

directory information.
Yes, we could add in it into the xlog, but that seems to be an

overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g.

reading

any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

The comment for the new function XLogMakePGDirectory is wrong:
+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.
The correct one is:
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
It is fixed in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Paul Guo (#9)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#10)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0004-Fix-failure-of-standby-startup-caused-by-tablespace-.patchtext/x-patch; charset=us-asciiDownload

From bc97e195f21af5d715d85424efc21fcbe8bb902c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 Apr 2019 20:59:15 +0900
Subject: [PATCH 4/5] Fix failure of standby startup caused by tablespace
 removal

When standby restarts after a crash after drop of a tablespace,
there's a possibility that recovery fails trying an object-creation in
already removed tablespace directory. Allow recovery to continue by
ignoring the error if not reaching consistency point.
---
 src/backend/access/transam/xlogutils.c | 34 ++++++++++++++++++++++++++++++++++
 src/backend/commands/tablespace.c      | 12 ++++++------
 src/backend/storage/file/copydir.c     | 12 +++++++-----
 src/include/access/xlogutils.h         |  1 +
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..75cdb882cd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,40 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/* Don't issue ERROR for this failure before reaching consistency. */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 66a70871e6..c9fb2af015 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -303,12 +303,6 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 				(errcode(ERRCODE_INVALID_NAME),
 				 errmsg("tablespace location cannot contain single quotes")));
 
-	/* Reject tablespaces in the data directory. */
-	if (is_in_data_directory(location))
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-				 errmsg("tablespace location must not be inside the data directory")));
-
 	/*
 	 * Check that location isn't too long. Remember that we're going to append
 	 * 'PG_XXX/<dboid>/<relid>_<fork>.<nnn>'.  In the relative path case, we
@@ -323,6 +317,12 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 				 errmsg("tablespace location \"%s\" is too long",
 						location)));
 
+	/* Reject tablespaces in the data directory. */
+	if (is_in_data_directory(location))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				 errmsg("tablespace location must not be inside the data directory")));
+
 	/*
 	 * Disallow creation of tablespaces named "pg_xxx"; we reserve this
 	 * namespace for system purposes.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
 	char		fromfile[MAXPGPATH * 2];
 	char		tofile[MAXPGPATH * 2];
 
-	if (MakePGDirectory(todir) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m", todir)));
+	/*
+	 * We might have to skip copydir to continue recovery. See the function
+	 * for details.
+	 */
+	if (XLogMakePGDirectory(todir) != 0)
+		return;
 
 	xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					   BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
-- 
2.16.3

#12

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#11)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in

<CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this

database

create redo error, but I suspect some other kind of redo, which

depends on

the files under the directory (they are not copied since the directory

is

not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us

cope

| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().

2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change
could fix that.

diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
         */
        FlushDatabaseBuffers(xlrec->src_db_id);

+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
        /*
         * Copy this subdirectory to the new location
         *

If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to
replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(),
etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are not
easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow
down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

#13

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Paul Guo (#12)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:

Show quoted text

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote

in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this

database

create redo error, but I suspect some other kind of redo, which

depends on

the files under the directory (they are not copied since the

directory is

not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us

cope

| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead

suppress

| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().

2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code
change could fix that.
diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
*/
FlushDatabaseBuffers(xlrec->src_db_id);
+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
/*
* Copy this subdirectory to the new location
*
If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

Attachments:

v2-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchapplication/octet-stream; name=v2-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchDownload

From 650221d14c1c34023aa66b1c22398eeae00dbcb8 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Tue, 30 Apr 2019 13:30:49 +0800
Subject: [PATCH v2] skip copydir() if either src directory or dst directory is
 missing due to re-redoing create database but the tablespace is dropped.

---
 src/backend/access/rmgrdesc/dbasedesc.c | 14 ++++++----
 src/backend/commands/dbcommands.c       | 35 ++++++++++++++++++++++++-
 src/backend/commands/tablespace.c       | 28 +-------------------
 3 files changed, 44 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..35092ffb0e 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,21 +23,25 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char	   *dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
-		appendStringInfo(buf, "dir %u/%u",
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "dir %s", dbpath1);
+		pfree(dbpath1);
 	}
 }
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9707afabd9..b7943529be 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,6 +45,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2089,7 +2090,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    do_copydir = true;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2107,6 +2110,35 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that the tablespace was later dropped, but we are
+			 * re-redoing database create before that. In that case,
+			 * the directory are missing, we simply skip the copydir step.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				do_copydir = false;
+				ereport(WARNING,
+						(errmsg("directory \"%s\" for copydir() does not exists."
+								"It is possibly expected. Skip copydir().",
+								parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/* src directory is possibly missing also. See previous comment. */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			do_copydir = false;
+			ereport(WARNING,
+					(errmsg("source directory \"%s\" for copydir() does not exists."
+							"It is possibly expected. Skip copydir().",
+							src_path)));
+		}
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2119,7 +2151,8 @@ dbase_redo(XLogReaderState *record)
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(src_path, dst_path, false);
+		if (do_copydir)
+			copydir(src_path, dst_path, false);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 8ec963f1cf..798c4586b8 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -154,8 +154,6 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -168,32 +166,8 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
-- 
2.17.2

#14

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Paul Guo (#13)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi.

At Tue, 30 Apr 2019 14:33:47 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGhmDKrq7JJu2rLLqcJBR8pA4OYrKsirZ5Ft8-deG1e8A@mail.gmail.com>

I updated the original patch to

It's reasonable not to touch copydir.

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

Nice catch.

+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {

This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.

+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));

This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.

# I'm not confident in this at all:(

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍

Yeah, good to hear.

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:

If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Agreed to this.

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#14)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.

I do not understand this. Can you elaborate?

+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));
This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.

Yeah. Looks better.

# I'm not confident in this at all:(

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is not
needed.

Whatever ignore mkdir failure or mkdir_p, I found that these steps

seem to

be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.

#16

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Paul Guo (#15)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>

Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?

Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));
This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.
Yeah. Looks better.

# I'm not confident in this at all:(

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is not
needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Whatever ignore mkdir failure or mkdir_p, I found that these steps

seem to

be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.

I didn't mention replication. I said that that slows recovery,
which is not governed by master's speed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#16)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <
CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>
Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/

total 0

lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2

$ ls -lh /tmp/2

total 0

drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is

not

needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Yes, this is a good reason to keep the original code. Thanks.

By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.

#18

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Paul Guo (#17)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, May 27, 2019 at 9:39 PM Paul Guo <pguo@pivotal.io> wrote:

On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <
CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>
Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.
This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/

total 0

lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2

$ ls -lh /tmp/2

total 0

drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be

more

graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is

not

needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Yes, this is a good reason to keep the original code. Thanks.

By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.

I updated the patch to v3. In this version, we skip the error if copydir
fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism
(Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a
previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed
to make the message accurate.

Thanks.

Attachments:

v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchapplication/octet-stream; name=v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchDownload

From 968adb96f7440386e5ee70f66fbd7495068a09e6 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Tue, 30 Apr 2019 13:30:49 +0800
Subject: [PATCH v3] skip copydir() if either src directory or dst directory is
 missing due to re-redoing create database but the tablespace is dropped.

Also correct dbase_desc() so that related xlog description is not misleading.

Add tap tests for the previous tablespace patch.

One of the test and related change in PostgresNode.pm are actually
from community (Kyotaro HORIGUCHI). I added another test and modified
PostgresNode.pm further to suport my added test.

This patch uses the log/forget mechanism to avoid bad ignoring.
---
 src/backend/access/rmgrdesc/dbasedesc.c   |  14 ++-
 src/backend/access/transam/xlog.c         |   4 +
 src/backend/commands/dbcommands.c         | 107 +++++++++++++++++++++-
 src/include/commands/dbcommands.h         |   2 +
 src/test/perl/PostgresNode.pm             |  13 ++-
 src/test/perl/RecursiveCopy.pm            |  33 ++++++-
 src/test/recovery/t/011_crash_recovery.pl |  99 +++++++++++++++++++-
 7 files changed, 259 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..35092ffb0e 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,21 +23,25 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char	   *dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
-		appendStringInfo(buf, "dir %u/%u",
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "dir %s", dbpath1);
+		pfree(dbpath1);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e829..5137a75efd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/tablespace.h"
+#include "commands/dbcommands.h"
 #include "common/controldata_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -7842,6 +7843,9 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/* Check whether some missing directories are unexpected. */
+		CheckMissingDirs4DbaseRedo();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 15207bf75a..bc7ab3084e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,6 +45,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -92,6 +93,68 @@ static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 
+static void log_missing_directory(char *dir);
+static void forget_missing_directory(char *dir);
+
+/*
+ * During XLOG replay, we may see either src directory or dst directory
+ * is missing during copying directory when creating database in dbase_redo()
+ * if the related tablespace was later dropped but we do re-redoing in
+ * recovery after abnormal shutdown. We do simply ignore copying in
+ * dbase_redo() but log those directories in memory and then sanity check
+ * the potential bug or bad user behaviors.
+ *
+ * We use List for simplicity since this should be ok for most cases - the
+ * list should be not long in usual case.
+ */
+static List *missing_dirs_dbase_redo = NIL;
+
+void
+CheckMissingDirs4DbaseRedo()
+{
+	ListCell *lc;
+
+	if (missing_dirs_dbase_redo == NIL)
+		return;
+
+	foreach(lc, missing_dirs_dbase_redo)
+	{
+		char *dir_entry = (char *) lfirst(lc);
+
+		elog(LOG, "Directory \"%s\" was missing during directory copying "
+			 "when replaying 'database create'", dir_entry);
+	}
+
+	elog(PANIC, "WAL replay was wrong due to previous missing directories");
+}
+
+static void
+log_missing_directory(char *dir)
+{
+	elog(DEBUG2, "Logging missing directory for dbase_redo(): \"%s\"", dir);
+	missing_dirs_dbase_redo = lappend(missing_dirs_dbase_redo, pstrdup(dir));
+}
+
+static void
+forget_missing_directory(char *dir)
+{
+	ListCell *prev, *lc;
+
+	prev = NULL;
+	foreach(lc, missing_dirs_dbase_redo)
+	{
+		char *dir_entry = (char *) lfirst(lc);
+
+		if (strcmp(dir_entry, dir) == 0)
+		{
+			missing_dirs_dbase_redo = list_delete_cell(missing_dirs_dbase_redo, lc, prev);
+			elog(DEBUG2, "forgetting missing directory for dbase_redo(): \"%s\"", dir);
+			return;
+		}
+
+		prev = lc;
+	}
+}
 
 /*
  * CREATE DATABASE
@@ -2089,7 +2152,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    do_copydir = true;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2107,6 +2172,43 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that the tablespace was previously dropped, but
+			 * we are re-redoing database create with that tablespace after
+			 * an abnormal shutdown (e.g. immediate shutdown). In that case,
+			 * the directory are missing, we simply skip the copydir step.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				do_copydir = false;
+				log_missing_directory(dst_path);
+				ereport(WARNING,
+						(errmsg("Skip creating database directory \"%s\". "
+								"The dest tablespace may have been removed "
+								"before abnormal shutdown. If the removal "
+								"is illegal after later checking we will panic.",
+								parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/* src directory is possibly missing during redo also. */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			do_copydir = false;
+			log_missing_directory(src_path);
+			ereport(WARNING,
+					(errmsg("Skip creating database directory based on "
+							"\"%s\". The src tablespace may have been "
+							"removed before abnormal shutdown. If the removal "
+							"is illegal after later checking we will panic.",
+							src_path)));
+
+		}
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2119,7 +2221,8 @@ dbase_redo(XLogReaderState *record)
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(src_path, dst_path, false);
+		if (do_copydir)
+			copydir(src_path, dst_path, false);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -2162,6 +2265,8 @@ dbase_redo(XLogReaderState *record)
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
 
+		forget_missing_directory(dst_path);
+
 		if (InHotStandby)
 		{
 			/*
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 28bf21153d..4893e0e289 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,6 +19,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"
 
+extern void CheckMissingDirs4DbaseRedo(void);
+
 extern Oid	createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok);
 extern ObjectAddress RenameDatabase(const char *oldname, const char *newname);
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8d5ad6bc16..e5b465dafd 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -551,13 +551,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..c912ce412d 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::real_dir($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 5dc52412ca..fad6a64d2a 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,100 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# Ensure that tablespace removal doesn't cause error while recoverying
+# the preceding create datbase or objects.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::real_dir($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::real_dir($tspDir_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master=$realTSDir_standby");
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This leaves a CREATE DATBASE WAL record
+# that is to be applied to already-removed tablespace.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE db1 WITH TABLESPACE ts1;
+						  DROP DATABASE db1;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# Ensure that tablespace removal doesn't cause error while recoverying
+# the preceding create datbase or objects.
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+$tspDir_master = TestLib::tempdir;
+$realTSDir_master = TestLib::real_dir($tspDir_master);
+mkdir "$realTSDir_master/1";
+mkdir "$realTSDir_master/2";
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master/1'");
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts2 LOCATION '$realTSDir_master/2'");
+
+$tspDir_standby = TestLib::tempdir;
+$realTSDir_standby = TestLib::real_dir($tspDir_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master/1=$realTSDir_standby/1,$realTSDir_master/2=$realTSDir_standby/2");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown ...
+$node_master->safe_psql('postgres',
+						q[ALTER DATABASE db1 SET TABLESPACE ts2;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
-- 
2.17.2

#19

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Paul Guo (#18)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <pguo@pivotal.io> wrote:

I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate.

Hello Paul,

FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/555368907

--
Thomas Munro
https://enterprisedb.com

#20

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Thomas Munro (#19)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Jul 8, 2019 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <pguo@pivotal.io> wrote:

I updated the patch to v3. In this version, we skip the error if copydir

fails due to missing src/dst directory,

but to make sure the ignoring is legal, I add a simple log/forget

mechanism (Using List) similar to the xlog invalid page

checking mechanism. Two tap tests are included. One is actually from a

previous patch by Kyotaro in this

email thread and another is added by me. In addition, dbase_desc() is

fixed to make the message accurate.

Hello Paul,

FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:

https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_postgresql-2Dcfbot_postgresql_builds_555368907&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=ABylo8AVfubiiYVbCBSgmNnHEMJhMqGXx5c0hkug7Vw&s=5h4m_JhrZwZqsRsu1CHCD3W2eBl14mT8jWLFsj2-bJ4&e=

This failure is because the previous v3 patch does not align with a recent
patch

commit 660a2b19038b2f6b9f6bcb2c3297a47d5e3557a8

Author: Noah Misch <noah@leadboat.com>

Date: Fri Jun 21 20:34:23 2019 -0700

Consolidate methods for translating a Perl path to a Windows path.

My patch uses TestLib::real_dir which is now replaced
with TestLib::perl2host in the above commit.

I've updated the patch to v4 to make my code align. Now the test passes in
my local environment.

Please see the attached v4 patch.

Thanks.

Attachments:

v4-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchapplication/octet-stream; name=v4-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchDownload

From 7339a82089d2f363a55cb651d3651a2c01f2af7d Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Tue, 30 Apr 2019 13:30:49 +0800
Subject: [PATCH v4] skip copydir() if either src directory or dst directory is
 missing due to re-redoing create database but the tablespace is dropped.

Also correct dbase_desc() so that related xlog description is not misleading.

This patch uses the log/forget mechanism to avoid bad ignoring.

Kyotaro horiguchi added one test and did related change in PostgresNode.pm and
had a lot of discussion on this issue. I further added another test and
modified PostgresNode.pm again to support my new test.
---
 src/backend/access/rmgrdesc/dbasedesc.c   |  14 ++-
 src/backend/access/transam/xlog.c         |   4 +
 src/backend/commands/dbcommands.c         | 107 +++++++++++++++++++++-
 src/include/commands/dbcommands.h         |   2 +
 src/test/perl/PostgresNode.pm             |  13 ++-
 src/test/perl/RecursiveCopy.pm            |  33 ++++++-
 src/test/recovery/t/011_crash_recovery.pl |  99 +++++++++++++++++++-
 7 files changed, 259 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..35092ffb0e 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,21 +23,25 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char	   *dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
-		appendStringInfo(buf, "dir %u/%u",
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "dir %s", dbpath1);
+		pfree(dbpath1);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b6c9353cbd..aa3e5c726c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/tablespace.h"
+#include "commands/dbcommands.h"
 #include "common/controldata_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -7855,6 +7856,9 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/* Check whether some missing directories are unexpected. */
+		CheckMissingDirs4DbaseRedo();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 863f89f19d..8ae804467e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,6 +45,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -92,6 +93,68 @@ static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 
+static void log_missing_directory(char *dir);
+static void forget_missing_directory(char *dir);
+
+/*
+ * During XLOG replay, we may see either src directory or dst directory
+ * is missing during copying directory when creating database in dbase_redo()
+ * if the related tablespace was later dropped but we do re-redoing in
+ * recovery after abnormal shutdown. We do simply ignore copying in
+ * dbase_redo() but log those directories in memory and then sanity check
+ * the potential bug or bad user behaviors.
+ *
+ * We use List for simplicity since this should be ok for most cases - the
+ * list should be not long in usual case.
+ */
+static List *missing_dirs_dbase_redo = NIL;
+
+void
+CheckMissingDirs4DbaseRedo()
+{
+	ListCell *lc;
+
+	if (missing_dirs_dbase_redo == NIL)
+		return;
+
+	foreach(lc, missing_dirs_dbase_redo)
+	{
+		char *dir_entry = (char *) lfirst(lc);
+
+		elog(LOG, "Directory \"%s\" was missing during directory copying "
+			 "when replaying 'database create'", dir_entry);
+	}
+
+	elog(PANIC, "WAL replay was wrong due to previous missing directories");
+}
+
+static void
+log_missing_directory(char *dir)
+{
+	elog(DEBUG2, "Logging missing directory for dbase_redo(): \"%s\"", dir);
+	missing_dirs_dbase_redo = lappend(missing_dirs_dbase_redo, pstrdup(dir));
+}
+
+static void
+forget_missing_directory(char *dir)
+{
+	ListCell *prev, *lc;
+
+	prev = NULL;
+	foreach(lc, missing_dirs_dbase_redo)
+	{
+		char *dir_entry = (char *) lfirst(lc);
+
+		if (strcmp(dir_entry, dir) == 0)
+		{
+			missing_dirs_dbase_redo = list_delete_cell(missing_dirs_dbase_redo, lc, prev);
+			elog(DEBUG2, "forgetting missing directory for dbase_redo(): \"%s\"", dir);
+			return;
+		}
+
+		prev = lc;
+	}
+}
 
 /*
  * CREATE DATABASE
@@ -2108,7 +2171,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    do_copydir = true;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2126,6 +2191,43 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that the tablespace was previously dropped, but
+			 * we are re-redoing database create with that tablespace after
+			 * an abnormal shutdown (e.g. immediate shutdown). In that case,
+			 * the directory are missing, we simply skip the copydir step.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				do_copydir = false;
+				log_missing_directory(dst_path);
+				ereport(WARNING,
+						(errmsg("Skip creating database directory \"%s\". "
+								"The dest tablespace may have been removed "
+								"before abnormal shutdown. If the removal "
+								"is illegal after later checking we will panic.",
+								parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/* src directory is possibly missing during redo also. */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			do_copydir = false;
+			log_missing_directory(src_path);
+			ereport(WARNING,
+					(errmsg("Skip creating database directory based on "
+							"\"%s\". The src tablespace may have been "
+							"removed before abnormal shutdown. If the removal "
+							"is illegal after later checking we will panic.",
+							src_path)));
+
+		}
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2138,7 +2240,8 @@ dbase_redo(XLogReaderState *record)
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(src_path, dst_path, false);
+		if (do_copydir)
+			copydir(src_path, dst_path, false);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -2181,6 +2284,8 @@ dbase_redo(XLogReaderState *record)
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
 
+		forget_missing_directory(dst_path);
+
 		if (InHotStandby)
 		{
 			/*
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 28bf21153d..4893e0e289 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,6 +19,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"
 
+extern void CheckMissingDirs4DbaseRedo(void);
+
 extern Oid	createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok);
 extern ObjectAddress RenameDatabase(const char *oldname, const char *newname);
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37f91..ba9b3f180f 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -542,13 +542,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..514ed90ae7 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::perl2host($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 5dc52412ca..a769236438 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,100 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# Ensure that tablespace removal doesn't cause error while recovering
+# the preceding create database with that tablespace.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::perl2host($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master=$realTSDir_standby");
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This leaves a CREATE DATBASE WAL record
+# that is to be applied to already-removed tablespace.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE db1 WITH TABLESPACE ts1;
+						  DROP DATABASE db1;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# Ensure that tablespace removal doesn't cause error while recovering the
+# preceding alter database set tablespace.
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+$tspDir_master = TestLib::tempdir;
+$realTSDir_master = TestLib::perl2host($tspDir_master);
+mkdir "$realTSDir_master/1";
+mkdir "$realTSDir_master/2";
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master/1'");
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts2 LOCATION '$realTSDir_master/2'");
+
+$tspDir_standby = TestLib::tempdir;
+$realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master/1=$realTSDir_standby/1,$realTSDir_master/2=$realTSDir_standby/2");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown ...
+$node_master->safe_psql('postgres',
+						q[ALTER DATABASE db1 SET TABLESPACE ts2;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
-- 
2.17.2

#21

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Paul Guo (#20)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Jul 15, 2019 at 10:52 PM Paul Guo <pguo@pivotal.io> wrote:

Please see the attached v4 patch.

While moving this to the next CF, I noticed that this needs updating
for the new pg_list.h API.

--
Thomas Munro
https://enterprisedb.com

#22

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Thomas Munro (#21)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

On Fri, Aug 2, 2019 at 6:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Show quoted text

On Mon, Jul 15, 2019 at 10:52 PM Paul Guo <pguo@pivotal.io> wrote:

Please see the attached v4 patch.

While moving this to the next CF, I noticed that this needs updating
for the new pg_list.h API.

--
Thomas Munro

https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=1zhC6VaaS7Ximav7vaUXMUt6EGjrVZpNZut32ug7LDI&s=jSDXnTPIW4WNZCCZ_HIbu7gZ3apEBx36DCeNeNuhLpY&e=

Attachments:

v5-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchapplication/octet-stream; name=v5-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchDownload

From 1b6c0c2c67cfcdedb3de93d9a048cf86e4ae04f6 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@pivotal.io>
Date: Tue, 30 Apr 2019 13:30:49 +0800
Subject: [PATCH v5] skip copydir() if either src directory or dst directory is
 missing due to re-redoing create database but the tablespace is dropped.

Also correct dbase_desc() so that related xlog description is not misleading.

This patch uses the log/forget mechanism to avoid bad ignoring.

Kyotaro horiguchi added one test and did related change in PostgresNode.pm and
had a lot of discussion on this issue. I further added another test and
modified PostgresNode.pm again to support my new test.
---
 src/backend/access/rmgrdesc/dbasedesc.c   |  14 +--
 src/backend/access/transam/xlog.c         |   4 +
 src/backend/commands/dbcommands.c         | 104 +++++++++++++++++++++-
 src/include/commands/dbcommands.h         |   2 +
 src/test/perl/PostgresNode.pm             |  13 ++-
 src/test/perl/RecursiveCopy.pm            |  33 ++++++-
 src/test/recovery/t/011_crash_recovery.pl |  99 +++++++++++++++++++-
 7 files changed, 256 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..35092ffb0e 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,21 +23,25 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char	   *dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
-		appendStringInfo(buf, "dir %u/%u",
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "dir %s", dbpath1);
+		pfree(dbpath1);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e651a841bb..0c4928a7c7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/tablespace.h"
+#include "commands/dbcommands.h"
 #include "common/controldata_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -7858,6 +7859,9 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/* Check whether some missing directories are unexpected. */
+		CheckMissingDirs4DbaseRedo();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 95881a8550..bf85677bd4 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,6 +45,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -92,6 +93,65 @@ static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 
+static void log_missing_directory(char *dir);
+static void forget_missing_directory(char *dir);
+
+/*
+ * During XLOG replay, we may see either src directory or dst directory
+ * is missing during copying directory when creating database in dbase_redo()
+ * if the related tablespace was later dropped but we do re-redoing in
+ * recovery after abnormal shutdown. We do simply ignore copying in
+ * dbase_redo() but log those directories in memory and then sanity check
+ * the potential bug or bad user behaviors.
+ *
+ * We use List for simplicity since this should be ok for most cases - the
+ * list should be not long in usual case.
+ */
+static List *missing_dirs_dbase_redo = NIL;
+
+void
+CheckMissingDirs4DbaseRedo()
+{
+	ListCell *lc;
+
+	if (missing_dirs_dbase_redo == NIL)
+		return;
+
+	foreach(lc, missing_dirs_dbase_redo)
+	{
+		char *dir_entry = (char *) lfirst(lc);
+
+		elog(LOG, "Directory \"%s\" was missing during directory copying "
+			 "when replaying 'database create'", dir_entry);
+	}
+
+	elog(PANIC, "WAL replay was wrong due to previous missing directories");
+}
+
+static void
+log_missing_directory(char *dir)
+{
+	elog(DEBUG2, "Logging missing directory for dbase_redo(): \"%s\"", dir);
+	missing_dirs_dbase_redo = lappend(missing_dirs_dbase_redo, pstrdup(dir));
+}
+
+static void
+forget_missing_directory(char *dir)
+{
+	ListCell *lc;
+
+	foreach(lc, missing_dirs_dbase_redo)
+	{
+		char *dir_entry = (char *) lfirst(lc);
+
+		if (strcmp(dir_entry, dir) == 0)
+		{
+			missing_dirs_dbase_redo = list_delete_cell(missing_dirs_dbase_redo, lc);
+			elog(DEBUG2, "forgetting missing directory for dbase_redo(): \"%s\"", dir);
+			return;
+		}
+	}
+}
 
 /*
  * CREATE DATABASE
@@ -2129,7 +2189,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    do_copydir = true;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2147,6 +2209,43 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that the tablespace was previously dropped, but
+			 * we are re-redoing database create with that tablespace after
+			 * an abnormal shutdown (e.g. immediate shutdown). In that case,
+			 * the directory are missing, we simply skip the copydir step.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				do_copydir = false;
+				log_missing_directory(dst_path);
+				ereport(WARNING,
+						(errmsg("Skip creating database directory \"%s\". "
+								"The dest tablespace may have been removed "
+								"before abnormal shutdown. If the removal "
+								"is illegal after later checking we will panic.",
+								parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/* src directory is possibly missing during redo also. */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			do_copydir = false;
+			log_missing_directory(src_path);
+			ereport(WARNING,
+					(errmsg("Skip creating database directory based on "
+							"\"%s\". The src tablespace may have been "
+							"removed before abnormal shutdown. If the removal "
+							"is illegal after later checking we will panic.",
+							src_path)));
+
+		}
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2159,7 +2258,8 @@ dbase_redo(XLogReaderState *record)
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(src_path, dst_path, false);
+		if (do_copydir)
+			copydir(src_path, dst_path, false);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -2202,6 +2302,8 @@ dbase_redo(XLogReaderState *record)
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
 
+		forget_missing_directory(dst_path);
+
 		if (InHotStandby)
 		{
 			/*
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 154c8157ee..26e96b8957 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,6 +19,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"
 
+extern void CheckMissingDirs4DbaseRedo(void);
+
 extern Oid	createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok);
 extern ObjectAddress RenameDatabase(const char *oldname, const char *newname);
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..b5dc3a8918 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..514ed90ae7 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::perl2host($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..5ac7303dd6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,100 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# Ensure that tablespace removal doesn't cause error while recovering
+# the preceding create database with that tablespace.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::perl2host($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master=$realTSDir_standby");
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This leaves a CREATE DATBASE WAL record
+# that is to be applied to already-removed tablespace.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE db1 WITH TABLESPACE ts1;
+						  DROP DATABASE db1;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# Ensure that tablespace removal doesn't cause error while recovering the
+# preceding alter database set tablespace.
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+$tspDir_master = TestLib::tempdir;
+$realTSDir_master = TestLib::perl2host($tspDir_master);
+mkdir "$realTSDir_master/1";
+mkdir "$realTSDir_master/2";
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master/1'");
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts2 LOCATION '$realTSDir_master/2'");
+
+$tspDir_standby = TestLib::tempdir;
+$realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master/1=$realTSDir_standby/1,$realTSDir_master/2=$realTSDir_standby/2");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown ...
+$node_master->safe_psql('postgres',
+						q[ALTER DATABASE db1 SET TABLESPACE ts2;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
-- 
2.17.2

#23

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Paul Guo (#22)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

22.08.2019 16:13, Paul Guo wrote:

Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

Hi,
Thank you for working on this fix.
The overall design of the latest version looks good to me.
But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to
archiveRecovery too.
That can cause a silent loss of a tablespace during regular standby
operation
since it never calls CheckRecoveryConsistency().

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory, i.e.
/tmp/tblspc2
mkdir /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

6) create another database in tblspc1 on master:
create database db2 tablespace tblspc1 ;

7) resume replication on standby:
select pg_wal_replay_resume();

8) try to connect to db2 on standby

It's expected that dbase_redo() will fail because the directory on
standby is not found.
While with the patch it suppresses the error until we attempt to connect
db2 on the standby:

2019-08-22 18:34:39.178 MSK [21066] HINT: Execute
pg_wal_replay_resume() to continue.
2019-08-22 18:42:41.656 MSK [21066] WARNING: Skip creating database
directory "pg_tblspc/16384/PG_13_201908012". The dest tablespace may
have been removed before abnormal shutdown. If the removal is illegal
after later checking we will panic.
2019-08-22 18:42:41.656 MSK [21066] CONTEXT: WAL redo at 0/3027738 for
Database/CREATE: copy dir base/1 to pg_tblspc/16384/PG_13_201908012/16390
2019-08-22 18:42:46.096 MSK [21688] FATAL:
"pg_tblspc/16384/PG_13_201908012/16390" is not a valid data directory
2019-08-22 18:42:46.096 MSK [21688] DETAIL: File
"pg_tblspc/16384/PG_13_201908012/16390/PG_VERSION" is missing.

Also some nitpicking about code style:
1) Please, add comment to forget_missing_directory().

2) + elog(LOG, "Directory \"%s\" was missing during
directory copying "
I think we'd better update this message elevel to WARNING.

3) Shouldn't we also move FlushDatabaseBuffers(xlrec->src_db_id); call under
if (do_copydir) clause?
I don't see a reason to flush pages that we won't use later.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#24

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Anastasia Lubennikova (#23)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2019-Aug-22, Anastasia Lubennikova wrote:

22.08.2019 16:13, Paul Guo wrote:

Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

Hi,
Thank you for working on this fix.
The overall design of the latest version looks good to me.
But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to
archiveRecovery too.

Hello

Paul, Kyotaro, are you working on updating this bugfix? FWIW the latest
patch submitted by Paul is still current and CFbot says it passes its
own test, but from Anastasia's email it still needs a bit of work.

Also: it would be good to have this new bogus scenario described by
Anastasia covered by a new TAP test. Anastasia, can we enlist you to
write that? Maybe Kyotaro?

Thanks

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#25

Paul Guo

pguo@pivotal.io

over 6 years ago

In reply to: Alvaro Herrera (#24)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Tue, Sep 3, 2019 at 11:58 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2019-Aug-22, Anastasia Lubennikova wrote:

22.08.2019 16:13, Paul Guo wrote:

Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

Hi,
Thank you for working on this fix.
The overall design of the latest version looks good to me.
But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to
archiveRecovery too.

Hello

Paul, Kyotaro, are you working on updating this bugfix? FWIW the latest
patch submitted by Paul is still current and CFbot says it passes its
own test, but from Anastasia's email it still needs a bit of work.

Also: it would be good to have this new bogus scenario described by
Anastasia covered by a new TAP test. Anastasia, can we enlist you to
write that? Maybe Kyotaro?

Thanks Anastasia and Alvaro for comment and suggestion. Sorry I've been busy
working on some non-PG stuffs recently. I've never worked on archive
recovery,
so I expect a bit more time after I'm free (hopefully several days later)
to take a look.
Of course Kyotaro, Anastasia or anyone feel free to address the concern
before that.

#26

Asim R P

apraveen@pivotal.io

over 6 years ago

In reply to: Anastasia Lubennikova (#23)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi Anastasia

On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to

archiveRecovery too.

That can cause a silent loss of a tablespace during regular standby

operation

since it never calls CheckRecoveryConsistency().

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory, i.e.

/tmp/tblspc2

mkdir /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

By changing the tablespace symlink target, we are silently nullifying
effects of a committed transaction from the standby data directory - the
directory structure created by the standby for create tablespace
transaction. This step, therefore, does not look like a valid test case to
me. Can you share a sequence of steps that does not involve changing data
directory manually?

Also some nitpicking about code style:
1) Please, add comment to forget_missing_directory().

2) + elog(LOG, "Directory \"%s\" was missing during

directory copying "

I think we'd better update this message elevel to WARNING.

3) Shouldn't we also move FlushDatabaseBuffers(xlrec->src_db_id); call

under

if (do_copydir) clause?
I don't see a reason to flush pages that we won't use later.

Thank you for the review feedback. I agree with all the points. Let me
incorporate them (I plan to pick this work up and drive it to completion as
Paul got busy with other things).

But before that I'm revisiting another solution upthread, that of creating
restart points when replaying create/drop database commands before making
filesystem changes such as removing a directory. The restart points should
align with checkpoints on master. The concern against this solution was
creation of restart points will slow down recovery. I don't think crash
recovery is affected by this solution because of the already existing
enforcement of checkpoints. WAL records prior to a create/drop database
will not be seen by crash recovery due to the checkpoint enforced during
the command's normal execution.

Asim

#27

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Asim R P (#26)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

10.09.2019 14:42, Asim R P wrote:

Hi Anastasia

On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:

But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to

archiveRecovery too.

That can cause a silent loss of a tablespace during regular standby

operation

since it never calls CheckRecoveryConsistency().

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory, i.e.

/tmp/tblspc2

mkdir /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

By changing the tablespace symlink target, we are silently nullifying
effects of a committed transaction from the standby data directory -
the directory structure created by the standby for create tablespace
transaction. This step, therefore, does not look like a valid test
case to me. Can you share a sequence of steps that does not involve
changing data directory manually?

Hi, the whole idea of the test is to reproduce a data loss. For example,
if the disk containing this tablespace failed.
Probably, simply deleting the directory
'postgresql_data_replica/pg_tblspc/16384'
would work as well, though I was afraid that it can be caught by some
earlier checks and my example won't be so illustrative.

Thank you for the review feedback. I agree with all the points. Let
me incorporate them (I plan to pick this work up and drive it to
completion as Paul got busy with other things).

But before that I'm revisiting another solution upthread, that of
creating restart points when replaying create/drop database commands
before making filesystem changes such as removing a directory. The
restart points should align with checkpoints on master. The concern
against this solution was creation of restart points will slow down
recovery. I don't think crash recovery is affected by this solution
because of the already existing enforcement of checkpoints. WAL
records prior to a create/drop database will not be seen by crash
recovery due to the checkpoint enforced during the command's normal
execution.

I haven't measured the impact of generating extra restart points in
previous solution, so I cannot tell whether concerns upthread are
justified. Still, I enjoy latest design more, since it is clear and
similar with the code of checking unexpected uninitialized pages. In
principle it works. And the issue, I described in previous review can be
easily fixed by several additional checks of InHotStandby macro.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#28

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Anastasia Lubennikova (#27)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Wed, 11 Sep 2019 17:26:44 +0300, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote in <a82a896b-93f0-c26c-b941-f5665131381b@postgrespro.ru>

10.09.2019 14:42, Asim R P wrote:

Hi Anastasia

On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:

But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to
archiveRecovery too.
That can cause a silent loss of a tablespace during regular standby
operation
since it never calls CheckRecoveryConsistency().

Yeah. We should take the same steps with redo operations on
missing pages. Just record failure during inconsistent state then
forget it if underlying tablespace is gone. If we had a record
when we reached concsistency, we're in a serious situation and
should stop recovery. log_invalid_page forget_invalid_pages and
CheckRecoveryConsistency are the entry points of the feature to
understand.

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory,
i.e. /tmp/tblspc2
mkdir /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

By changing the tablespace symlink target, we are silently nullifying
effects of a committed transaction from the standby data directory -
the directory structure created by the standby for create tablespace
transaction. This step, therefore, does not look like a valid test
case to me. Can you share a sequence of steps that does not involve
changing data directory manually?

I see it as the same. WAL is inconsistent with what happend on
storage with the steps. Database is just broken.

Hi, the whole idea of the test is to reproduce a data loss. For
example, if the disk containing this tablespace failed.

So, apparently we must start recovery from a backup before that
failure happened in that case, and that should ends in success.

# I remember that the start point of this patch is a crash after
# table space drop subsequent to several operations within the
# table space. Then, crash recovery fails at an operation in the
# finally-removed tablespace. Is it right?

Probably, simply deleting the directory
'postgresql_data_replica/pg_tblspc/16384'
would work as well, though I was afraid that it can be caught by some
earlier checks and my example won't be so illustrative.

Thank you for the review feedback. I agree with all the points. Let
me incorporate them (I plan to pick this work up and drive it to
completion as Paul got busy with other things).

But before that I'm revisiting another solution upthread, that of
creating restart points when replaying create/drop database commands
before making filesystem changes such as removing a directory. The
restart points should align with checkpoints on master. The concern
against this solution was creation of restart points will slow down
recovery. I don't think crash recovery is affected by this solution
because of the already existing enforcement of checkpoints. WAL
records prior to a create/drop database will not be seen by crash
recovery due to the checkpoint enforced during the command's normal
execution.

I haven't measured the impact of generating extra restart points in
previous solution, so I cannot tell whether concerns upthread are
justified. Still, I enjoy latest design more, since it is clear and
similar with the code of checking unexpected uninitialized pages. In
principle it works. And the issue, I described in previous review can
be easily fixed by several additional checks of InHotStandby macro.

Generally we shouldn't trigger useless restart point for
uncertain reasons. If standby crashes, it starts the next
recovery from the latest *restart point*. Even in that case what
we should do is the same.

Of course, for testing, we *should* establish a restartpoint
manually in order to establish the prerequisite state.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#29

Asim R P

apraveen@pivotal.io

over 6 years ago

In reply to: Kyotaro Horiguchi (#28)

2 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Thu, Sep 12, 2019 at 2:05 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

Hello.

At Wed, 11 Sep 2019 17:26:44 +0300, Anastasia Lubennikova <

a.lubennikova@postgrespro.ru> wrote in <
a82a896b-93f0-c26c-b941-f5665131381b@postgrespro.ru>

10.09.2019 14:42, Asim R P wrote:

Hi Anastasia

On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:

But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to
archiveRecovery too.
That can cause a silent loss of a tablespace during regular standby
operation
since it never calls CheckRecoveryConsistency().

Yeah. We should take the same steps with redo operations on
missing pages. Just record failure during inconsistent state then
forget it if underlying tablespace is gone. If we had a record
when we reached concsistency, we're in a serious situation and
should stop recovery. log_invalid_page forget_invalid_pages and
CheckRecoveryConsistency are the entry points of the feature to
understand.

Yes, I get it now. I will adjust the patch written by Paul accordingly.

# I remember that the start point of this patch is a crash after
# table space drop subsequent to several operations within the
# table space. Then, crash recovery fails at an operation in the
# finally-removed tablespace. Is it right?

That's correct. Once the directories are removed from filesystem, any
attempt to replay WAL records that depend on their existence fails.

But before that I'm revisiting another solution upthread, that of
creating restart points when replaying create/drop database commands
before making filesystem changes such as removing a directory. The
restart points should align with checkpoints on master. The concern
against this solution was creation of restart points will slow down
recovery. I don't think crash recovery is affected by this solution
because of the already existing enforcement of checkpoints. WAL
records prior to a create/drop database will not be seen by crash
recovery due to the checkpoint enforced during the command's normal
execution.

I haven't measured the impact of generating extra restart points in
previous solution, so I cannot tell whether concerns upthread are
justified. Still, I enjoy latest design more, since it is clear and
similar with the code of checking unexpected uninitialized pages. In
principle it works. And the issue, I described in previous review can
be easily fixed by several additional checks of InHotStandby macro.

Generally we shouldn't trigger useless restart point for
uncertain reasons. If standby crashes, it starts the next
recovery from the latest *restart point*. Even in that case what
we should do is the same.

The reason is quite clear to me - removing directories from filesystem
break the ability to replay WAL records second time. And we already create
checkpoints during normal operation in such a case, so crash recovery on a
master node does not suffer from this bug. I've attached a patch that
performs restart points during drop database replay, just for reference.
It passes both the TAP tests written by Kyotaro and Paul. I had to modify
drop database WAL record a bit.

Asim

Attachments:

v1-0001-Create-restartpoint-when-replaying-drop-database.patchapplication/octet-stream; name=v1-0001-Create-restartpoint-when-replaying-drop-database.patchDownload

From bd98ac1a2e98bc67238103b0f2764cf7fe0edc58 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Thu, 12 Sep 2019 17:17:29 +0530
Subject: [PATCH v1 1/2] Create restartpoint when replaying drop database

Drop database replay involves removing the database directory.  We do
not have a mechanism similar to invalid page detection for directories
during WAL replay.  If, due to a crash, WAL replay must resume from a
checkpoint, we should avoid replaying WAL records second time, that
precede the drop database.

Proposed by Paul Guo.
---
 src/backend/access/rmgrdesc/dbasedesc.c |  6 ++-
 src/backend/commands/dbcommands.c       | 84 +++++++++++++++++++++------------
 src/include/access/xlog.h               |  1 -
 src/include/commands/dbcommands_xlog.h  |  3 +-
 4 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..b97bac2411 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -36,8 +36,10 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
-		appendStringInfo(buf, "dir %u/%u",
-						 xlrec->tablespace_id, xlrec->db_id);
+		int i;
+		for (i = 0; i < xlrec->nspcids; i++)
+			appendStringInfo(buf, "\ndir %u/%u",
+							 xlrec->db_id, xlrec->tablespace_ids[i]);
 	}
 }
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 95881a8550..c0c8726698 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1400,16 +1400,19 @@ movedb(const char *dbname, const char *tblspcname)
 	 * Record the filesystem change in XLOG
 	 */
 	{
-		xl_dbase_drop_rec xlrec;
+		size_t xlrec_size = sizeof(xl_dbase_drop_rec) + sizeof(Oid);
+		xl_dbase_drop_rec *xlrec = palloc(xlrec_size);
 
-		xlrec.db_id = db_id;
-		xlrec.tablespace_id = src_tblspcoid;
+		xlrec->db_id = db_id;
+		xlrec->nspcids = 1;
+		xlrec->tablespace_ids[0] = src_tblspcoid;
 
 		XLogBeginInsert();
-		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
+		XLogRegisterData((char *) &xlrec, xlrec_size);
 
 		(void) XLogInsert(RM_DBASE_ID,
 						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
+		pfree(xlrec);
 	}
 
 	/* Now it's safe to release the database lock */
@@ -1914,6 +1917,8 @@ remove_dbtablespaces(Oid db_id)
 	Relation	rel;
 	TableScanDesc scan;
 	HeapTuple	tuple;
+	List *dstpaths = NIL;
+	List *spcoids = NIL;
 
 	rel = table_open(TableSpaceRelationId, AccessShareLock);
 	scan = table_beginscan_catalog(rel, 0, NULL);
@@ -1936,31 +1941,39 @@ remove_dbtablespaces(Oid db_id)
 			pfree(dstpath);
 			continue;
 		}
+		dstpaths = lappend(dstpaths, dstpath);
+		spcoids = lappend_oid(spcoids, dsttablespace);
+	}
+
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	size_t xlrec_size = (sizeof(xl_dbase_drop_rec) +
+						 sizeof(Oid)*list_length(spcoids));
+	xl_dbase_drop_rec *xlrec = palloc(xlrec_size);
+	xlrec->db_id = db_id;
+	xlrec->nspcids = list_length(spcoids);
 
-		if (!rmtree(dstpath, true))
+	int i=0;
+	const ListCell *cell1, *cell2;
+	forboth(cell1, dstpaths, cell2, spcoids)
+	{
+		char *path = lfirst(cell1);
+		if (!rmtree(path, true))
 			ereport(WARNING,
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
-							dstpath)));
-
+							path)));
+		pfree(path);
 		/* Record the filesystem change in XLOG */
-		{
-			xl_dbase_drop_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dsttablespace;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
-		}
-
-		pfree(dstpath);
+		xlrec->tablespace_ids[i++] = lfirst_oid(cell2);
 	}
 
-	table_endscan(scan);
-	table_close(rel, AccessShareLock);
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, xlrec_size);
+	(void) XLogInsert(RM_DBASE_ID,
+					  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
+
+	pfree(xlrec);
 }
 
 /*
@@ -2166,8 +2179,6 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
 		char	   *dst_path;
 
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
-
 		if (InHotStandby)
 		{
 			/*
@@ -2196,11 +2207,26 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
+		/*
+		 * If we crash after removing directories, we should avoid replaying
+		 * WAL records prior to the current WAL record (drop database).
+		 * Creating a restartpoint ensures that recovery will start from at
+		 * least this point onwards in the event of a crash / immediate
+		 * shutdown.
+		 */
+		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+						  CHECKPOINT_WAIT);
+
 		/* And remove the physical files */
-		if (!rmtree(dst_path, true))
-			ereport(WARNING,
-					(errmsg("some useless files may be left behind in old database directory \"%s\"",
-							dst_path)));
+		int i;
+		for (i = 0; i < xlrec->nspcids; i++)
+		{
+			dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			if (!rmtree(dst_path, true))
+				ereport(WARNING,
+						(errmsg("some useless files may be left behind in old database directory \"%s\"",
+								dst_path)));
+		}
 
 		if (InHotStandby)
 		{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..d0582a726b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -221,7 +221,6 @@ extern bool XLOG_DEBUG;
 /* These indicate the cause of a checkpoint request */
 #define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
-
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
  */
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 46be8a615a..a1654f25e0 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -34,7 +34,8 @@ typedef struct xl_dbase_drop_rec
 {
 	/* Records dropping of a single subdirectory incl. contents */
 	Oid			db_id;
-	Oid			tablespace_id;
+	uint16		nspcids;
+	Oid			tablespace_ids[0];
 } xl_dbase_drop_rec;
 
 extern void dbase_redo(XLogReaderState *rptr);
-- 
2.14.3 (Apple Git-98)

v1-0002-Test-to-validate-replay-of-WAL-records-involving-.patchapplication/octet-stream; name=v1-0002-Test-to-validate-replay-of-WAL-records-involving-.patchDownload

From 7c311632576bc8f48ac43a9dd0fa07e86e6b47e6 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Thu, 12 Sep 2019 17:19:45 +0530
Subject: [PATCH v1 2/2] Test to validate replay of WAL records involving drop
 database

Authored by Paul Guo and Kyotaro Horiguchi.
---
 src/test/perl/PostgresNode.pm             | 13 +++-
 src/test/perl/RecursiveCopy.pm            | 33 +++++++++--
 src/test/recovery/t/011_crash_recovery.pl | 99 ++++++++++++++++++++++++++++++-
 3 files changed, 138 insertions(+), 7 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..b5dc3a8918 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..514ed90ae7 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::perl2host($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..5ac7303dd6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,100 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# Ensure that tablespace removal doesn't cause error while recovering
+# the preceding create database with that tablespace.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::perl2host($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master=$realTSDir_standby");
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This leaves a CREATE DATBASE WAL record
+# that is to be applied to already-removed tablespace.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE db1 WITH TABLESPACE ts1;
+						  DROP DATABASE db1;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# Ensure that tablespace removal doesn't cause error while recovering the
+# preceding alter database set tablespace.
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+$tspDir_master = TestLib::tempdir;
+$realTSDir_master = TestLib::perl2host($tspDir_master);
+mkdir "$realTSDir_master/1";
+mkdir "$realTSDir_master/2";
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master/1'");
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts2 LOCATION '$realTSDir_master/2'");
+
+$tspDir_standby = TestLib::tempdir;
+$realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master/1=$realTSDir_standby/1,$realTSDir_master/2=$realTSDir_standby/2");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown ...
+$node_master->safe_psql('postgres',
+						q[ALTER DATABASE db1 SET TABLESPACE ts2;
+						  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
-- 
2.14.3 (Apple Git-98)

#30

Asim R P

apraveen@pivotal.io

over 6 years ago

In reply to: Paul Guo (#22)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Thu, Aug 22, 2019 at 6:44 PM Paul Guo <pguo@pivotal.io> wrote:

Thanks. I updated the patch to v5. It passes install-check testing and

recovery testing.

This patch contains one more bug, in addition to what Anastasia has found.
If the test case in the patch is tweaked slightly, as follows, the standby
crashes due to PANIC.

--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -147,8 +147,6 @@ $node_standby->start;
 $node_master->poll_query_until(
        'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');

-$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
-
# Make sure to perform restartpoint after tablespace creation
$node_master->wait_for_catchup($node_standby, 'replay',

$node_master->lsn('replay'));
@@ -156,7 +154,8 @@ $node_standby->safe_psql('postgres', 'CHECKPOINT');

 # Do immediate shutdown ...
 $node_master->safe_psql('postgres',
-                                               q[ALTER DATABASE db1 SET
TABLESPACE ts2;
+                                               q[CREATE DATABASE db1
TABLESPACE ts1;
+                                                 ALTER DATABASE db1 SET
TABLESPACE ts2;
                                                  DROP TABLESPACE ts1;]);
 $node_master->wait_for_catchup($node_standby, 'replay',

$node_master->lsn('replay'));

Notice the create additional create database in the above change. That
causes the same tablespace directory (ts1) logged twice in the list of
missing directories. At the end of crash recovery, there is one unmatched
entry in the missing dirs list and the standby PANICs.

Please find attached a couple of tests that are built on top of what was
already written by Paul, Kyotaro. The patch includes a test to demonstrate
the above mentioned failure and a test case that my friend Alexandra wrote
to implement the archive recovery scenario noted by Anastasia.

In order to fix the test failures, we need to distinguish between a missing
database directory and a missing tablespace directory. And also add logic
to forget missing directories during tablespace drop. I am working on it.

Asim

Attachments:

0001-Tests-for-replay-of-create-database-operation-on-sta.patchapplication/octet-stream; name=0001-Tests-for-replay-of-create-database-operation-on-sta.patchDownload

From 26b34d171d2bb185a2d927b88a77a4b0cacb0c88 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Thu, 19 Sep 2019 17:14:55 +0530
Subject: [PATCH] Tests for replay of create database operation on standby

A couple of tests to demonstrate that standby fails to replay a create
database WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a drop
tablespace or drop database WAL record has been replayed in archive
recovery, before a crash.  And then the create database record happens
to be replayed again during crash recovery.  The failures indicate bugs
that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra, Anastasia, Kyotaro, Paul and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 150 +++++++++++++++++++++++++++++-
 1 file changed, 149 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..1cea17c7d4 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,6 +6,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 if ($Config{osname} eq 'MSWin32')
 {
@@ -15,7 +16,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +67,150 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is hangled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    tablespace.
+#
+# 2. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 3. Create a datbase from another database as template then drop the
+#    template database.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master = TestLib::tempdir;
+$dropme_ts_master = TestLib::perl2host($dropme_ts_master);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts location '$dropme_ts_master';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby = TestLib::tempdir;
+$dropme_ts_standby = TestLib::perl2host($dropme_ts_standby);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = "$dropme_ts_master=$dropme_ts_standby," .
+  "$source_ts_master=$source_ts_standby," .
+  "$target_ts_master=$target_ts_standby";
+$node_master->backup($backup_name, tablespace_mappings => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  DROP DATABASE dropme_db;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP TABLESPACE source_ts;
+						  DROP TABLESPACE dropme_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master4');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::perl2host($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::perl2host($tspDir_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$realTSDir_master=$realTSDir_standby");
+$node_standby = get_new_node('standby4');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# XXX For some reason, the tablespace mapping is not honored and the
+# standby ends up getting a different temp dir than what was specified
+# in the tablepsace mapping.  So get the tablespace directory by
+# querying standby.
+$realTSDir_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($realTSDir_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until_params(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		timeout => 5) == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3 (Apple Git-98)

#31

Asim R P

apraveen@pivotal.io

over 6 years ago

In reply to: Asim R P (#30)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Thu, Sep 19, 2019 at 5:29 PM Asim R P <apraveen@pivotal.io> wrote:

In order to fix the test failures, we need to distinguish between a

missing database directory and a missing tablespace directory. And also
add logic to forget missing directories during tablespace drop. I am
working on it.

Please find attached a solution that builds on what Paul has propose. A
hash table, similar to the invalid page hash table is used to track missing
directory references. A missing directory may be a tablespace or a
database, based on whether the tablespace is found missing or the source
database is found missing. The crash recovery succeeds if the hash table
is empty at the end.

Asim

Attachments:

v6-0001-Support-node-initialization-from-backup-with-tabl.patchapplication/octet-stream; name=v6-0001-Support-node-initialization-from-backup-with-tabl.patchDownload

From f676cd4c9d698b0f9e4f86d6251253ac6ddda36f Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v6 1/3] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro
---
 src/test/perl/RecursiveCopy.pm | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..514ed90ae7 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::perl2host($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
-- 
2.14.3 (Apple Git-98)

v6-0002-Tests-to-replay-create-database-operation-on-stan.patchapplication/octet-stream; name=v6-0002-Tests-to-replay-create-database-operation-on-stan.patchDownload

From 45f2eb5ddd2c819942cc299e97eb017cd04b5181 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v6 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra, Anastasia, Kyotaro, Paul and me.
---
 src/test/perl/PostgresNode.pm             |  31 +++++-
 src/test/recovery/t/011_crash_recovery.pl | 152 +++++++++++++++++++++++++++++-
 2 files changed, 177 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..959fa24bba 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
@@ -1550,9 +1559,21 @@ sub poll_query_until
 
 	$expected = 't' unless defined($expected);    # default value
 
+	$self->poll_query_until_params(
+		$dbname, $query,
+		expected => $expected, timeout => 180);
+}
+
+sub poll_query_until_params
+{
+	my ($self, $dbname, $query, %params) = @_;
+
+	$params{expected} = 't' unless defined($params{expected});
+	$params{timeout} = 180 unless defined($params{timeout});
+
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -1562,7 +1583,7 @@ sub poll_query_until
 		chomp($stdout);
 		$stdout =~ s/\r//g if $TestLib::windows_os;
 
-		if ($stdout eq $expected)
+		if ($stdout eq $params{expected})
 		{
 			return 1;
 		}
@@ -1580,7 +1601,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..23397232b8 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,6 +6,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 if ($Config{osname} eq 'MSWin32')
 {
@@ -15,7 +16,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +67,152 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is hangled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    tablespace.
+#
+# 2. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 3. Create a datbase from another database as template then drop the
+#    template database.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master = TestLib::tempdir;
+$dropme_ts_master = TestLib::perl2host($dropme_ts_master);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts location '$dropme_ts_master';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby = TestLib::tempdir;
+$dropme_ts_standby = TestLib::perl2host($dropme_ts_standby);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = "$dropme_ts_master=$dropme_ts_standby," .
+  "$source_ts_master=$source_ts_standby," .
+  "$target_ts_master=$target_ts_standby";
+$node_master->backup($backup_name, tablespace_mappings => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  DROP DATABASE dropme_db1;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE dropme_db2;
+						  DROP TABLESPACE dropme_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$ts_master=$ts_standby");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until_params(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		timeout => 5) == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3 (Apple Git-98)

v6-0003-Fix-replay-of-create-database-records-on-standby.patchapplication/octet-stream; name=v6-0003-Fix-replay-of-create-database-records-on-standby.patchDownload

From ee35455705ac7ea81b5db6e090591971d277c5a8 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:43:25 +0530
Subject: [PATCH v6 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul.

Authored by Paul, Kyotaro and me.

Discussion: CAEET0ZGx9AvioViLf7nbR_8tH9-%3D27DN5xWJ2P9-ROH16e4JUA%40mail.gmail.com
---
 src/backend/access/rmgrdesc/dbasedesc.c |  14 ++--
 src/backend/access/transam/xlog.c       |   6 ++
 src/backend/access/transam/xlogutils.c  | 130 ++++++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c       |  53 +++++++++++++
 src/backend/commands/tablespace.c       |   3 +
 src/include/access/xlogutils.h          |   4 +
 src/include/commands/dbcommands.h       |   2 +
 7 files changed, 207 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..35092ffb0e 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,21 +23,25 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char	   *dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 
-		appendStringInfo(buf, "dir %u/%u",
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "dir %s", dbpath1);
+		pfree(dbpath1);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b7ff004234..749e8d5961 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7858,6 +7858,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 1fc39333f1..e1384afe30 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -56,6 +56,136 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (reachedConsistency)
+		elog(PANIC, "cannot find directory %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+		elog(DEBUG2, "missing directory %s tablespace %d database %d already exists: %s",
+			 path, spcNode, dbNode, entry->path);
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		elog(DEBUG2, "logged missing dir %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+		elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+			 path, spcNode, dbNode);
+	else
+		elog(DEBUG2, "forgot missing dir %s for tablespace %d database %d",
+			 path, spcNode, dbNode);
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 95881a8550..5388800660 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,6 +45,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2129,7 +2130,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2147,6 +2150,54 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2202,6 +2253,8 @@ dbase_redo(XLogReaderState *record)
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
 
+		XLogForgetMissingDir(xlrec->tablespace_id, xlrec->db_id, dst_path);
+
 		if (InHotStandby)
 		{
 			/*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 84efb414d8..0d553c39c8 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1517,6 +1518,8 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 4105b59904..7034b691bd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 154c8157ee..26e96b8957 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,6 +19,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"
 
+extern void CheckMissingDirs4DbaseRedo(void);
+
 extern Oid	createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok);
 extern ObjectAddress RenameDatabase(const char *oldname, const char *newname);
-- 
2.14.3 (Apple Git-98)

#32

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

about 6 years ago

In reply to: Asim R P (#31)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

20.09.2019 15:23, Asim R P wrote:

On Thu, Sep 19, 2019 at 5:29 PM Asim R P <apraveen@pivotal.io
<mailto:apraveen@pivotal.io>> wrote:

In order to fix the test failures, we need to distinguish between a

missing database directory and a missing tablespace directory. And
also add logic to forget missing directories during tablespace drop.
I am working on it.

Please find attached a solution that builds on what Paul has propose.
A hash table, similar to the invalid page hash table is used to track
missing directory references. A missing directory may be a tablespace
or a database, based on whether the tablespace is found missing or the
source database is found missing. The crash recovery succeeds if the
hash table is empty at the end.

The v6-0003 patch had merge conflicts due to the recent
'xl_dbase_drop_rec' change, so I rebased it.
See v7-0003 in attachment. Changes are pretty straightforward, though It
would be great, if you could check them once more.

Newly introduced test 4 in t/011_crash_recovery.pl fails without the
patch and passes with it.
It seems to me that everything is fine, so I mark it "Ready For Committer"

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v6-0001-Support-node-initialization-from-backup-with-tabl.patchtext/x-patch; name=v6-0001-Support-node-initialization-from-backup-with-tabl.patchDownload

From f676cd4c9d698b0f9e4f86d6251253ac6ddda36f Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v6 1/3] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro
---
 src/test/perl/RecursiveCopy.pm | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..514ed90ae7 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::perl2host($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
-- 
2.14.3 (Apple Git-98)

v6-0002-Tests-to-replay-create-database-operation-on-stan.patchtext/x-patch; name=v6-0002-Tests-to-replay-create-database-operation-on-stan.patchDownload

From 45f2eb5ddd2c819942cc299e97eb017cd04b5181 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v6 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra, Anastasia, Kyotaro, Paul and me.
---
 src/test/perl/PostgresNode.pm             |  31 +++++-
 src/test/recovery/t/011_crash_recovery.pl | 152 +++++++++++++++++++++++++++++-
 2 files changed, 177 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..959fa24bba 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
@@ -1550,9 +1559,21 @@ sub poll_query_until
 
 	$expected = 't' unless defined($expected);    # default value
 
+	$self->poll_query_until_params(
+		$dbname, $query,
+		expected => $expected, timeout => 180);
+}
+
+sub poll_query_until_params
+{
+	my ($self, $dbname, $query, %params) = @_;
+
+	$params{expected} = 't' unless defined($params{expected});
+	$params{timeout} = 180 unless defined($params{timeout});
+
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -1562,7 +1583,7 @@ sub poll_query_until
 		chomp($stdout);
 		$stdout =~ s/\r//g if $TestLib::windows_os;
 
-		if ($stdout eq $expected)
+		if ($stdout eq $params{expected})
 		{
 			return 1;
 		}
@@ -1580,7 +1601,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..23397232b8 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,6 +6,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 if ($Config{osname} eq 'MSWin32')
 {
@@ -15,7 +16,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +67,152 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is hangled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    tablespace.
+#
+# 2. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 3. Create a datbase from another database as template then drop the
+#    template database.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master = TestLib::tempdir;
+$dropme_ts_master = TestLib::perl2host($dropme_ts_master);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts location '$dropme_ts_master';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby = TestLib::tempdir;
+$dropme_ts_standby = TestLib::perl2host($dropme_ts_standby);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = "$dropme_ts_master=$dropme_ts_standby," .
+  "$source_ts_master=$source_ts_standby," .
+  "$target_ts_master=$target_ts_standby";
+$node_master->backup($backup_name, tablespace_mappings => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  DROP DATABASE dropme_db1;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE dropme_db2;
+						  DROP TABLESPACE dropme_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$ts_master=$ts_standby");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until_params(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		timeout => 5) == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3 (Apple Git-98)

v7-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; name=v7-0003-Fix-replay-of-create-database-records-on-standby.patchDownload

commit c1aa5f67df052467dd6d67cdf5dbd5388ecefb1f
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Thu Dec 26 17:30:33 2019 +0300

    Subject: [PATCH v7 3/3] Fix replay of create database records on standby
    
    Crash recovery on standby may encounter missing directories when
    replaying create database WAL records.  Prior to this patch, the
    standby would fail to recover in such a case.  However, the
    directories could be legitimately missing.  Consider a sequence of WAL
    records as follows:
    
        CREATE DATABASE
        DROP DATABASE
        DROP TABLESPACE
    
    If, after replaying the last WAL record and removing the tablespace
    directory, the standby crashes and has to replay the create database
    record again, the crash recovery must be able to move on.
    
    This patch adds mechanism similar to invalid page hash table, to track
    missing directories during crash recovery.  If all the missing
    directory references are matched with corresponding drop records at
    the end of crash recovery, the standby can safely enter archive
    recovery.
    
    Bug identified by Paul.
    
    Authored by Paul, Kyotaro and Asim R P.

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index d08c575..4858ce6 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,14 +23,17 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char		*dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -39,8 +42,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfo(buf, "dir");
 		for (i = 0; i < xlrec->ntablespaces; i++)
-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  "%s", dbpath1);
+			pfree(dbpath1);
+		}
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5658971..ea6661e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7890,6 +7890,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 14efbf3..1417707 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -56,6 +56,136 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (reachedConsistency)
+		elog(PANIC, "cannot find directory %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+		elog(DEBUG2, "missing directory %s tablespace %d database %d already exists: %s",
+			 path, spcNode, dbNode, entry->path);
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		elog(DEBUG2, "logged missing dir %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+		elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+			 path, spcNode, dbNode);
+	else
+		elog(DEBUG2, "forgot missing dir %s for tablespace %d database %d",
+			 path, spcNode, dbNode);
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index da0e5d8..b7cbb88 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2185,7 +2186,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2203,6 +2206,54 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2260,6 +2311,9 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id, dst_path);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 570dcb2..e4f6aad 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1516,6 +1517,8 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0572b24..938a35f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index d1e91a2..9d321e7 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,6 +19,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"
 
+extern void CheckMissingDirs4DbaseRedo(void);
+
 extern Oid	createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok, bool force);
 extern void DropDatabase(ParseState *pstate, DropdbStmt *stmt);

#33

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Anastasia Lubennikova (#32)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

I looked at this a little while and was bothered by the perl changes; it
seems out of place to have RecursiveCopy be thinking about tablespaces,
which is way out of its league. So I rewrote that to use a callback:
the PostgresNode code passes a callback that's in charge to handle the
case of a symlink. Things look much more in place with that. I didn't
verify that all places that should use this are filled.

In 0002 I found adding a new function unnecessary: we can keep backwards
compat by checking 'ref' of the third argument. With that we don't have
to add a new function. (POD changes pending.)

I haven't reviewed 0003.

v8 of all these patches attached.

"git am" told me your 0001 was in unrecognized format. It applied fine
with "patch". I suggest that if you're going to submit a series with
commit messages and all, please use "git format-patch" with the same
"-v" argument (9 in this case) for all patches.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v8-0001-Support-node-initialization-from-backup-with-tabl.patchtext/x-diff; charset=us-asciiDownload

From a81e747f0bfa90af8021e2399e196e177a93f62c Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v8 1/3] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro
---
 src/test/perl/PostgresNode.pm  | 29 +++++++++++++++++++++++-
 src/test/perl/RecursiveCopy.pm | 40 ++++++++++++++++++++++++++++------
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2e0cf4a2f3..3cae483ddb 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -593,6 +593,32 @@ sub backup_fs_cold
 	return;
 }
 
+sub _srcsymlink
+{
+	my ($srcpath, $destpath) = @_;
+
+	croak "Cannot operate on symlink \"$srcpath\""
+		if ($srcpath !~ qr{/(pg_tblspc/[0-9]+)$});
+
+	# We have mapped tablespaces. Copy them individually
+	my $tmpdir = TestLib::tempdir;
+	my $dstrealdir = TestLib::perl2host($tmpdir);
+	my $srcrealdir = readlink($srcpath);
+
+	opendir(my $dh, $srcrealdir);
+	while (readdir $dh)
+	{
+		next if (/^\.\.?$/);
+		my $spath = "$srcrealdir/$_";
+		my $dpath = "$dstrealdir/$_";
+		RecursiveCopy::copypath($spath, $dpath);
+	}
+	closedir $dh;
+
+	symlink $dstrealdir, $destpath;
+
+	return 1;
+}
 
 # Common sub of backup_fs_hot and backup_fs_cold
 sub _backup_fs
@@ -680,7 +706,8 @@ sub init_from_backup
 
 	my $data_path = $self->data_dir;
 	rmdir($data_path);
-	RecursiveCopy::copypath($backup_path, $data_path);
+	RecursiveCopy::copypath($backup_path, $data_path,
+							srcsymlinkfn => \&_srcsymlink);
 	chmod(0700, $data_path);
 
 	# Base configuration for this node
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..715edcdedd 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -66,6 +66,7 @@ sub copypath
 {
 	my ($base_src_dir, $base_dest_dir, %params) = @_;
 	my $filterfn;
+	my $srcsymlinkfn;
 
 	if (defined $params{filterfn})
 	{
@@ -80,31 +81,55 @@ sub copypath
 		$filterfn = sub { return 1; };
 	}
 
+	if (defined $params{srcsymlinkfn})
+	{
+		croak "if specified, srcsymlinkfn must be a subroutine reference"
+			unless defined(ref $params{srcsymlinkfn})
+			and (ref $params{srcsymlinkfn} eq 'CODE');
+
+		$srcsymlinkfn = $params{srcsymlinkfn};
+	}
+	else
+	{
+		$srcsymlinkfn = undef;
+	}
+
 	# Complain if original path is bogus, because _copypath_recurse won't.
 	croak "\"$base_src_dir\" does not exist" if !-e $base_src_dir;
 
 	# Start recursive copy from current directory
-	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn);
+	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn, $srcsymlinkfn);
 }
 
 # Recursive private guts of copypath
 sub _copypath_recurse
 {
-	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn) = @_;
+	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn,
+		$srcsymlinkfn) = @_;
 	my $srcpath  = "$base_src_dir/$curr_path";
 	my $destpath = "$base_dest_dir/$curr_path";
 
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# If caller provided us with a callback, call it; otherwise we're out.
+	if (-l $srcpath)
+	{
+		if (defined $srcsymlinkfn)
+		{
+			return &$srcsymlinkfn($srcpath, $destpath);
+		}
+		else
+		{
+			croak "Cannot operate on symlink \"$srcpath\"";
+		}
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
@@ -137,7 +162,8 @@ sub _copypath_recurse
 		{
 			next if ($entry eq '.' or $entry eq '..');
 			_copypath_recurse($base_src_dir, $base_dest_dir,
-				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn)
+				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn,
+				$srcsymlinkfn)
 			  or die "copypath $srcpath/$entry -> $destpath/$entry failed";
 		}
 
-- 
2.20.1

v8-0002-Tests-to-replay-create-database-operation-on-stan.patchtext/x-diff; charset=us-asciiDownload

From 32e2c106aee30202f5731a163a6e2f1a88a6d06b Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v8 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra, Anastasia, Kyotaro, Paul and me.
---
 src/test/perl/PostgresNode.pm             |  34 ++++-
 src/test/recovery/t/011_crash_recovery.pl | 152 +++++++++++++++++++++-
 2 files changed, 178 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 3cae483ddb..e6e7ea505d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
@@ -1640,13 +1649,24 @@ Returns 1 if successful, 0 if timed out.
 
 sub poll_query_until
 {
-	my ($self, $dbname, $query, $expected) = @_;
+	my ($self, $dbname, $query, $params) = @_;
+	my $expected;
 
-	$expected = 't' unless defined($expected);    # default value
+	# Be backwards-compatible
+	if (defined $params and ref $params eq '')
+	{
+		$params = {
+			expected => $params,
+			timeout => 180
+		};
+	}
+
+	$params->{expected} = 't' unless defined($params->{expected});
+	$params->{timeout} = 180 unless defined($params->{timeout});
 
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params->{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -1656,7 +1676,7 @@ sub poll_query_until
 		chomp($stdout);
 		$stdout =~ s/\r//g if $TestLib::windows_os;
 
-		if ($stdout eq $expected)
+		if ($stdout eq $params->{expected})
 		{
 			return 1;
 		}
@@ -1674,7 +1694,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params->{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..013d3d5b0c 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,6 +6,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 if ($Config{osname} eq 'MSWin32')
 {
@@ -15,7 +16,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +67,152 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is hangled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    tablespace.
+#
+# 2. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 3. Create a datbase from another database as template then drop the
+#    template database.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master = TestLib::tempdir;
+$dropme_ts_master = TestLib::perl2host($dropme_ts_master);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts location '$dropme_ts_master';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby = TestLib::tempdir;
+$dropme_ts_standby = TestLib::perl2host($dropme_ts_standby);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = "$dropme_ts_master=$dropme_ts_standby," .
+  "$source_ts_master=$source_ts_standby," .
+  "$target_ts_master=$target_ts_standby";
+$node_master->backup($backup_name, tablespace_mappings => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  DROP DATABASE dropme_db1;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE dropme_db2;
+						  DROP TABLESPACE dropme_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$ts_master=$ts_standby");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		timeout => 5) == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.20.1

v8-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-diff; charset=us-asciiDownload

From 47ee0330541f22cfd934c59cc1ae6df34b08eea6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v8 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul.

Authored by Paul, Kyotaro and Asim R P.
---
 src/backend/access/rmgrdesc/dbasedesc.c |  16 ++-
 src/backend/access/transam/xlog.c       |   6 ++
 src/backend/access/transam/xlogutils.c  | 130 ++++++++++++++++++++++++
 src/backend/commands/dbcommands.c       |  54 ++++++++++
 src/backend/commands/tablespace.c       |   3 +
 src/include/access/xlogutils.h          |   4 +
 src/include/commands/dbcommands.h       |   2 +
 7 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 73d2a4ca34..f7117873d7 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,14 +23,17 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char		*dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -39,8 +42,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfo(buf, "dir");
 		for (i = 0; i < xlrec->ntablespaces; i++)
-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  "%s", dbpath1);
+			pfree(dbpath1);
+		}
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..d97e48f369 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7890,6 +7890,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b55c383370..6c2dd5aba1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -56,6 +56,136 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (reachedConsistency)
+		elog(PANIC, "cannot find directory %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+		elog(DEBUG2, "missing directory %s tablespace %d database %d already exists: %s",
+			 path, spcNode, dbNode, entry->path);
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		elog(DEBUG2, "logged missing dir %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+		elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+			 path, spcNode, dbNode);
+	else
+		elog(DEBUG2, "forgot missing dir %s for tablespace %d database %d",
+			 path, spcNode, dbNode);
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 367c30adb0..6d6668e4f8 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2185,7 +2186,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2203,6 +2206,54 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2260,6 +2311,9 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id, dst_path);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 051478057f..33407dceeb 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1516,6 +1517,8 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..4106735006 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index f8f6d5ffd0..b71b400e70 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,6 +19,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"
 
+extern void CheckMissingDirs4DbaseRedo(void);
+
 extern Oid	createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok, bool force);
 extern void DropDatabase(ParseState *pstate, DropdbStmt *stmt);
-- 
2.20.1

#34

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Alvaro Herrera (#33)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2020-Jan-09, Alvaro Herrera wrote:

I looked at this a little while and was bothered by the perl changes; it
seems out of place to have RecursiveCopy be thinking about tablespaces,
which is way out of its league. So I rewrote that to use a callback:
the PostgresNode code passes a callback that's in charge to handle the
case of a symlink. Things look much more in place with that. I didn't
verify that all places that should use this are filled.

In 0002 I found adding a new function unnecessary: we can keep backwards
compat by checking 'ref' of the third argument. With that we don't have
to add a new function. (POD changes pending.)

I forgot to add that something in these changes is broken (probably the
symlink handling callback) so the tests fail, but I couldn't stay away
from my daughter's birthday long enough to figure out what or how. I'm
on something else today, so if one of you can research and submit fixed
versions, that'd be great.

Thanks,

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#35

Paul Guo

pguo@pivotal.io

almost 6 years ago

In reply to: Alvaro Herrera (#34)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Jan 10, 2020 at 9:43 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2020-Jan-09, Alvaro Herrera wrote:

I looked at this a little while and was bothered by the perl changes; it
seems out of place to have RecursiveCopy be thinking about tablespaces,
which is way out of its league. So I rewrote that to use a callback:
the PostgresNode code passes a callback that's in charge to handle the
case of a symlink. Things look much more in place with that. I didn't
verify that all places that should use this are filled.

In 0002 I found adding a new function unnecessary: we can keep backwards
compat by checking 'ref' of the third argument. With that we don't have
to add a new function. (POD changes pending.)

I forgot to add that something in these changes is broken (probably the
symlink handling callback) so the tests fail, but I couldn't stay away
from my daughter's birthday long enough to figure out what or how. I'm
on something else today, so if one of you can research and submit fixed
versions, that'd be great.

Thanks,

I spent some time on this before getting off work today.

With below fix, the 4th test is now ok but the 5th (last one) hangs due to
panic.

(gdb) bt
#0 0x0000003397e32625 in raise () from /lib64/libc.so.6
#1 0x0000003397e33e05 in abort () from /lib64/libc.so.6
#2 0x0000000000a90506 in errfinish (dummy=0) at elog.c:590
#3 0x0000000000a92b4b in elog_finish (elevel=22, fmt=0xb2d580 "cannot find
directory %s tablespace %d database %d") at elog.c:1465
#4 0x000000000057aa0a in XLogLogMissingDir (spcNode=16384, dbNode=0,
path=0x1885100 "pg_tblspc/16384/PG_13_202001091/16389") at xlogutils.c:104
#5 0x000000000065e92e in dbase_redo (record=0x1841568) at dbcommands.c:2225
#6 0x000000000056ac94 in StartupXLOG () at xlog.c:7200

diff --git a/src/include/commands/dbcommands.h
b/src/include/commands/dbcommands.h
index b71b400e700..f8f6d5ffd03 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,8 +19,6 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"

-extern void CheckMissingDirs4DbaseRedo(void);
-
 extern Oid createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok, bool force);
 extern void DropDatabase(ParseState *pstate, DropdbStmt *stmt);
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e6e7ea505d9..4eef8bb1985 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -615,11 +615,11 @@ sub _srcsymlink
    my $srcrealdir = readlink($srcpath);

    opendir(my $dh, $srcrealdir);
-   while (readdir $dh)
+   while (my $entry = (readdir $dh))
    {
-       next if (/^\.\.?$/);
-       my $spath = "$srcrealdir/$_";
-       my $dpath = "$dstrealdir/$_";
+       next if ($entry eq '.' or $entry eq '..');
+       my $spath = "$srcrealdir/$entry";
+       my $dpath = "$dstrealdir/$entry";
        RecursiveCopy::copypath($spath, $dpath);
    }
    closedir $dh;

#36

Paul Guo

pguo@pivotal.io

almost 6 years ago

In reply to: Paul Guo (#35)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

I further fixed the last test failure (due to a small bug in the test, not
in code). Attached are the new patch series. Let's see the CI pipeline
result.

Attachments:

v9-0001-Support-node-initialization-from-backup-with-tabl.patchapplication/octet-stream; name=v9-0001-Support-node-initialization-from-backup-with-tabl.patchDownload

From fa8b7c964d170bbf7aeb1a7b0e94de28f651f388 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v9 1/3] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro
---
 src/test/perl/PostgresNode.pm  | 29 ++++++++++++++++++++++++++++-
 src/test/perl/RecursiveCopy.pm | 40 +++++++++++++++++++++++++++++++++-------
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2e0cf4a2f3e..587a12ef405 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -593,6 +593,32 @@ sub backup_fs_cold
 	return;
 }
 
+sub _srcsymlink
+{
+	my ($srcpath, $destpath) = @_;
+
+	croak "Cannot operate on symlink \"$srcpath\""
+		if ($srcpath !~ qr{/(pg_tblspc/[0-9]+)$});
+
+	# We have mapped tablespaces. Copy them individually
+	my $tmpdir = TestLib::tempdir;
+	my $dstrealdir = TestLib::perl2host($tmpdir);
+	my $srcrealdir = readlink($srcpath);
+
+	opendir(my $dh, $srcrealdir);
+	while (my $entry = (readdir $dh))
+	{
+		next if ($entry eq '.' or $entry eq '..');
+		my $spath = "$srcrealdir/$entry";
+		my $dpath = "$dstrealdir/$entry";
+		RecursiveCopy::copypath($spath, $dpath);
+	}
+	closedir $dh;
+
+	symlink $dstrealdir, $destpath;
+
+	return 1;
+}
 
 # Common sub of backup_fs_hot and backup_fs_cold
 sub _backup_fs
@@ -680,7 +706,8 @@ sub init_from_backup
 
 	my $data_path = $self->data_dir;
 	rmdir($data_path);
-	RecursiveCopy::copypath($backup_path, $data_path);
+	RecursiveCopy::copypath($backup_path, $data_path,
+							srcsymlinkfn => \&_srcsymlink);
 	chmod(0700, $data_path);
 
 	# Base configuration for this node
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63b..715edcdedd0 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -66,6 +66,7 @@ sub copypath
 {
 	my ($base_src_dir, $base_dest_dir, %params) = @_;
 	my $filterfn;
+	my $srcsymlinkfn;
 
 	if (defined $params{filterfn})
 	{
@@ -80,31 +81,55 @@ sub copypath
 		$filterfn = sub { return 1; };
 	}
 
+	if (defined $params{srcsymlinkfn})
+	{
+		croak "if specified, srcsymlinkfn must be a subroutine reference"
+			unless defined(ref $params{srcsymlinkfn})
+			and (ref $params{srcsymlinkfn} eq 'CODE');
+
+		$srcsymlinkfn = $params{srcsymlinkfn};
+	}
+	else
+	{
+		$srcsymlinkfn = undef;
+	}
+
 	# Complain if original path is bogus, because _copypath_recurse won't.
 	croak "\"$base_src_dir\" does not exist" if !-e $base_src_dir;
 
 	# Start recursive copy from current directory
-	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn);
+	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn, $srcsymlinkfn);
 }
 
 # Recursive private guts of copypath
 sub _copypath_recurse
 {
-	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn) = @_;
+	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn,
+		$srcsymlinkfn) = @_;
 	my $srcpath  = "$base_src_dir/$curr_path";
 	my $destpath = "$base_dest_dir/$curr_path";
 
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# If caller provided us with a callback, call it; otherwise we're out.
+	if (-l $srcpath)
+	{
+		if (defined $srcsymlinkfn)
+		{
+			return &$srcsymlinkfn($srcpath, $destpath);
+		}
+		else
+		{
+			croak "Cannot operate on symlink \"$srcpath\"";
+		}
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
@@ -137,7 +162,8 @@ sub _copypath_recurse
 		{
 			next if ($entry eq '.' or $entry eq '..');
 			_copypath_recurse($base_src_dir, $base_dest_dir,
-				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn)
+				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn,
+				$srcsymlinkfn)
 			  or die "copypath $srcpath/$entry -> $destpath/$entry failed";
 		}
 
-- 
2.14.3

v9-0003-Fix-replay-of-create-database-records-on-standby.patchapplication/octet-stream; name=v9-0003-Fix-replay-of-create-database-records-on-standby.patchDownload

From 57c4d4f31e501e5b25bb36919e666855fd2b0789 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v9 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/rmgrdesc/dbasedesc.c |  16 ++--
 src/backend/access/transam/xlog.c       |   6 ++
 src/backend/access/transam/xlogutils.c  | 130 ++++++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c       |  54 +++++++++++++
 src/backend/commands/tablespace.c       |   3 +
 src/include/access/xlogutils.h          |   4 +
 6 files changed, 208 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 73d2a4ca34b..f7117873d73 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,14 +23,17 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char		*dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -39,8 +42,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfo(buf, "dir");
 		for (i = 0; i < xlrec->ntablespaces; i++)
-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  "%s", dbpath1);
+			pfree(dbpath1);
+		}
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0eb..d97e48f369c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7890,6 +7890,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b55c3833703..6c2dd5aba1b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -56,6 +56,136 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (reachedConsistency)
+		elog(PANIC, "cannot find directory %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+		elog(DEBUG2, "missing directory %s tablespace %d database %d already exists: %s",
+			 path, spcNode, dbNode, entry->path);
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		elog(DEBUG2, "logged missing dir %s tablespace %d database %d",
+			 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+		elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+			 path, spcNode, dbNode);
+	else
+		elog(DEBUG2, "forgot missing dir %s for tablespace %d database %d",
+			 path, spcNode, dbNode);
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 367c30adb01..6d6668e4f88 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2185,7 +2186,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2203,6 +2206,54 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2260,6 +2311,9 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id, dst_path);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 051478057f6..33407dceeb2 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1516,6 +1517,8 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d96..41067350069 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.14.3

v9-0002-Tests-to-replay-create-database-operation-on-stan.patchapplication/octet-stream; name=v9-0002-Tests-to-replay-create-database-operation-on-stan.patchDownload

From f1c9874eb75a9a2979e56402b4a5a363a6f061bc Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v9 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/perl/PostgresNode.pm             |  34 +++++--
 src/test/recovery/t/011_crash_recovery.pl | 152 +++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 587a12ef405..4eef8bb1985 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
@@ -1640,13 +1649,24 @@ Returns 1 if successful, 0 if timed out.
 
 sub poll_query_until
 {
-	my ($self, $dbname, $query, $expected) = @_;
+	my ($self, $dbname, $query, $params) = @_;
+	my $expected;
+
+	# Be backwards-compatible
+	if (defined $params and ref $params eq '')
+	{
+		$params = {
+			expected => $params,
+			timeout => 180
+		};
+	}
 
-	$expected = 't' unless defined($expected);    # default value
+	$params->{expected} = 't' unless defined($params->{expected});
+	$params->{timeout} = 180 unless defined($params->{timeout});
 
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params->{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -1656,7 +1676,7 @@ sub poll_query_until
 		chomp($stdout);
 		$stdout =~ s/\r//g if $TestLib::windows_os;
 
-		if ($stdout eq $expected)
+		if ($stdout eq $params->{expected})
 		{
 			return 1;
 		}
@@ -1674,7 +1694,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params->{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb5..3690865e07b 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,6 +6,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 if ($Config{osname} eq 'MSWin32')
 {
@@ -15,7 +16,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +67,152 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is hangled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    tablespace.
+#
+# 2. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 3. Create a datbase from another database as template then drop the
+#    template database.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master = TestLib::tempdir;
+$dropme_ts_master = TestLib::perl2host($dropme_ts_master);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts location '$dropme_ts_master';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby = TestLib::tempdir;
+$dropme_ts_standby = TestLib::perl2host($dropme_ts_standby);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = "$dropme_ts_master=$dropme_ts_standby," .
+  "$source_ts_master=$source_ts_standby," .
+  "$target_ts_master=$target_ts_standby";
+$node_master->backup($backup_name, tablespace_mappings => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  DROP DATABASE dropme_db1;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE dropme_db2;
+						  DROP TABLESPACE dropme_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$ts_master=$ts_standby");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3

#37

Fujii Masao

masao.fujii@oss.nttdata.com

almost 6 years ago

In reply to: Paul Guo (#36)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2020/01/15 19:18, Paul Guo wrote:

I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

The approach that the 0003 patch uses is not the perfect solution.
If the standby crashes after tblspc_redo() removes the directory and before
its subsequent COMMIT record is replayed, PANIC error would occur since
there can be some unresolved missing directory entries when we reach the
consistent state. The problem would very rarely happen, though...
Just idea; calling XLogFlush() to update the minimum recovery point just
before tblspc_redo() performs destroy_tablespace_directories() may be
safe and helpful for the problem?

-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);

If the patch is for the bug fix and would be back-ported, the above change
would lead to change pg_waldump's output for CREATE/DROP DATABASE between
minor versions. IMO it's better to avoid such change and separate the above
as a separate patch only for master.

-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  "%s", dbpath1);
+			pfree(dbpath1);
+		}

Same as above.

BTW, the above "%s" should be " %s", i.e., a space character needs to be
appended to the head of "%s".

+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);

The third argument of XLogLogMissingDir() should be parent_path instead of
dst_path?

+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+		elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+			 path, spcNode, dbNode);

I think that this elog() is useless and rather confusing.

+ XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");

The third argument should be set to the actual path instead of an empty
string. Otherwise XLogForgetMissingDir() may emit a confusing DEBUG2
message. Or the third argument of XLogForgetMissingDir() should be removed
and the path in the DEBUG2 message should be calculated from the spcNode
and dbNode in the hash entry in XLogForgetMissingDir().

+#include "common/file_perm.h"

This seems not necessary.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#38

Fujii Masao

masao.fujii@oss.nttdata.com

almost 6 years ago

In reply to: Fujii Masao (#37)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2020/01/28 0:24, Fujii Masao wrote:

On 2020/01/15 19:18, Paul Guo wrote:

I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

I marked this patch as Waiting on Author in CF because there is no update
since my last review comments. Could you mark it as Needs Review again
if you post the updated version of the patch.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#39

Daniel Gustafsson

daniel@yesql.se

over 5 years ago

In reply to: Fujii Masao (#38)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 25 Mar 2020, at 06:52, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/01/28 0:24, Fujii Masao wrote:

On 2020/01/15 19:18, Paul Guo wrote:

I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!
I started reading the 0003 patch.

I marked this patch as Waiting on Author in CF because there is no update
since my last review comments. Could you mark it as Needs Review again
if you post the updated version of the patch.

This thread has been stalled since effectively January, so I'm marking this
patch Returned with Feedback. Feel free to open a new entry once the review
comments have been addressed.

cheers ./daniel

#40

Paul Guo

guopa@vmware.com

over 5 years ago

In reply to: Alvaro Herrera (#33)

4 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Looks like my previous reply was held for moderation (maybe due to my new email address).
I configured my pg account today using the new email address. I guess this email would be
held for moderation.

I’m now replying my previous reply email and attaching the new patch series.

On Jul 6, 2020, at 10:18 AM, Paul Guo <guopa@vmware.com<mailto:guopa@vmware.com>> wrote:

Thanks for the review. I’m now re-picking up the work. I modified the code following the comments.
Besides, I tweaked the test code a bit. There are several things I’m not 100% sure. Please see
my replies below.

On Jan 27, 2020, at 11:24 PM, Fujii Masao <masao.fujii@oss.nttdata.com<mailto:masao.fujii@oss.nttdata.com>> wrote:

On 2020/01/15 19:18, Paul Guo wrote:
I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

Yes looks like an issue. My understanding is the below scenario.

XLogLogMissingDir()

XLogFlush() in redo (e.g. in a commit redo). <- create a minimum recovery point (we call it LSN_A).

tblspc_redo()->XLogForgetMissingDir()
<- If we panic immediately after we remove the directory in tblspc_redo()
<- when we do replay during crash-recovery, we will check consistency at LSN_A and thus PANIC inXLogCheckMissingDirs()

commit

We should add a XLogFlush() in tblspc_redo(). This brings several other questions to my minds also.

1. Should we call XLogFlush() in dbase_redo() for XLOG_DBASE_DROP also?
It calls both XLogDropDatabase() and XLogForgetMissingDir, which seem to have this issue also?

2. xact_redo_abort() calls DropRelationFiles() also. Why do not we call XLogFlush() there?

- appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-  xlrec->src_tablespace_id, xlrec->src_db_id,
-  xlrec->tablespace_id, xlrec->db_id);
+ dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+ dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+ appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+ pfree(dbpath2);
+ pfree(dbpath1);

I know we do not want wal format between minor releases, but does wal description string change
between minor releases affect users? Anyone I’ll extract this part into a separate patch in the series
since this change is actually independent of the other changes..

- appendStringInfo(buf, " %u/%u",
-  xlrec->tablespace_ids[i], xlrec->db_id);
+ {
+ dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+ appendStringInfo(buf,  "%s", dbpath1);
+ pfree(dbpath1);
+ }

Same as above.

BTW, the above "%s" should be " %s", i.e., a space character needs to be
appended to the head of "%s”.

+ get_parent_directory(parent_path);
+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ {
+ XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);

The third argument of XLogLogMissingDir() should be parent_path instead of
dst_path?

The argument is for debug message printing so both should be fine, but admittedly we are
logging for the tablespace directory so parent_path might be better.

+ if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+ elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+  path, spcNode, dbNode);

I think that this elog() is useless and rather confusing.

OK. Modified.

+ XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");

I’m now removing the third argument. Use GetDatabasePath() to get the path if database did I snot InvalidOid.

+#include "common/file_perm.h"

This seems not necessary.

Right.

Attachments:

v10-0001-Support-node-initialization-from-backup-with-tab.patchapplication/octet-stream; name=v10-0001-Support-node-initialization-from-backup-with-tab.patchDownload

From b29747de93fb70d3a7a6f843e232d3dec747451e Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v10 1/4] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro Horiguchi
---
 src/test/perl/PostgresNode.pm  | 29 ++++++++++++++++++++++++++++-
 src/test/perl/RecursiveCopy.pm | 40 +++++++++++++++++++++++++++++++++-------
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1407359aef6..e1ddb36ff1d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -593,6 +593,32 @@ sub backup_fs_cold
 	return;
 }
 
+sub _srcsymlink
+{
+	my ($srcpath, $destpath) = @_;
+
+	croak "Cannot operate on symlink \"$srcpath\""
+		if ($srcpath !~ qr{/(pg_tblspc/[0-9]+)$});
+
+	# We have mapped tablespaces. Copy them individually
+	my $tmpdir = TestLib::tempdir;
+	my $dstrealdir = TestLib::perl2host($tmpdir);
+	my $srcrealdir = readlink($srcpath);
+
+	opendir(my $dh, $srcrealdir);
+	while (my $entry = (readdir $dh))
+	{
+		next if ($entry eq '.' or $entry eq '..');
+		my $spath = "$srcrealdir/$entry";
+		my $dpath = "$dstrealdir/$entry";
+		RecursiveCopy::copypath($spath, $dpath);
+	}
+	closedir $dh;
+
+	symlink $dstrealdir, $destpath;
+
+	return 1;
+}
 
 # Common sub of backup_fs_hot and backup_fs_cold
 sub _backup_fs
@@ -684,7 +710,8 @@ sub init_from_backup
 
 	my $data_path = $self->data_dir;
 	rmdir($data_path);
-	RecursiveCopy::copypath($backup_path, $data_path);
+	RecursiveCopy::copypath($backup_path, $data_path,
+							srcsymlinkfn => \&_srcsymlink);
 	chmod(0700, $data_path);
 
 	# Base configuration for this node
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63b..715edcdedd0 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -66,6 +66,7 @@ sub copypath
 {
 	my ($base_src_dir, $base_dest_dir, %params) = @_;
 	my $filterfn;
+	my $srcsymlinkfn;
 
 	if (defined $params{filterfn})
 	{
@@ -80,31 +81,55 @@ sub copypath
 		$filterfn = sub { return 1; };
 	}
 
+	if (defined $params{srcsymlinkfn})
+	{
+		croak "if specified, srcsymlinkfn must be a subroutine reference"
+			unless defined(ref $params{srcsymlinkfn})
+			and (ref $params{srcsymlinkfn} eq 'CODE');
+
+		$srcsymlinkfn = $params{srcsymlinkfn};
+	}
+	else
+	{
+		$srcsymlinkfn = undef;
+	}
+
 	# Complain if original path is bogus, because _copypath_recurse won't.
 	croak "\"$base_src_dir\" does not exist" if !-e $base_src_dir;
 
 	# Start recursive copy from current directory
-	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn);
+	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn, $srcsymlinkfn);
 }
 
 # Recursive private guts of copypath
 sub _copypath_recurse
 {
-	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn) = @_;
+	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn,
+		$srcsymlinkfn) = @_;
 	my $srcpath  = "$base_src_dir/$curr_path";
 	my $destpath = "$base_dest_dir/$curr_path";
 
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# If caller provided us with a callback, call it; otherwise we're out.
+	if (-l $srcpath)
+	{
+		if (defined $srcsymlinkfn)
+		{
+			return &$srcsymlinkfn($srcpath, $destpath);
+		}
+		else
+		{
+			croak "Cannot operate on symlink \"$srcpath\"";
+		}
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
@@ -137,7 +162,8 @@ sub _copypath_recurse
 		{
 			next if ($entry eq '.' or $entry eq '..');
 			_copypath_recurse($base_src_dir, $base_dest_dir,
-				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn)
+				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn,
+				$srcsymlinkfn)
 			  or die "copypath $srcpath/$entry -> $destpath/$entry failed";
 		}
 
-- 
2.14.3

v10-0002-Tests-to-replay-create-database-operation-on-sta.patchapplication/octet-stream; name=v10-0002-Tests-to-replay-create-database-operation-on-sta.patchDownload

From 00a3a6e028afeca2495e55634cba53f0c813b241 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v10 2/4] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/perl/PostgresNode.pm             |  34 +++++--
 src/test/recovery/t/011_crash_recovery.pl | 162 +++++++++++++++++++++++++++++-
 2 files changed, 188 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e1ddb36ff1d..81538bbbebe 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -546,13 +546,22 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mappings})
+	{
+		my @ts_mappings = split(/,/, $params{tablespace_mappings});
+		foreach my $elem (@ts_mappings) {
+			push(@rest, '--tablespace-mapping='.$elem);
+		}
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
@@ -1666,13 +1675,24 @@ Returns 1 if successful, 0 if timed out.
 
 sub poll_query_until
 {
-	my ($self, $dbname, $query, $expected) = @_;
+	my ($self, $dbname, $query, $params) = @_;
+	my $expected;
+
+	# Be backwards-compatible
+	if (defined $params and ref $params eq '')
+	{
+		$params = {
+			expected => $params,
+			timeout => 180
+		};
+	}
 
-	$expected = 't' unless defined($expected);    # default value
+	$params->{expected} = 't' unless defined($params->{expected});
+	$params->{timeout} = 180 unless defined($params->{timeout});
 
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params->{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -1682,7 +1702,7 @@ sub poll_query_until
 		chomp($stdout);
 		$stdout =~ s/\r//g if $TestLib::windows_os;
 
-		if ($stdout eq $expected)
+		if ($stdout eq $params->{expected})
 		{
 			return 1;
 		}
@@ -1700,7 +1720,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params->{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index ca6e92b50df..55f96519fa0 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,6 +6,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 if ($Config{osname} eq 'MSWin32')
 {
@@ -15,7 +16,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +67,162 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is handled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    database.
+#
+# 2. Create a database against a user-defined tablespace then drop the
+#    database and the tablespace.
+#
+# 3. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 4. Create a database from another database as template then drop the
+#    template database.
+#
+#
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master1 = TestLib::tempdir;
+$dropme_ts_master1 = TestLib::perl2host($dropme_ts_master1);
+my $dropme_ts_master2 = TestLib::tempdir;
+$dropme_ts_master2 = TestLib::perl2host($dropme_ts_master2);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts1 location '$dropme_ts_master1';
+						   CREATE TABLESPACE dropme_ts2 location '$dropme_ts_master2';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby1 = TestLib::tempdir;
+$dropme_ts_standby1 = TestLib::perl2host($dropme_ts_standby1);
+my $dropme_ts_standby2 = TestLib::tempdir;
+$dropme_ts_standby2 = TestLib::perl2host($dropme_ts_standby2);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = "$dropme_ts_master1=$dropme_ts_standby1," .
+  "$dropme_ts_master2=$dropme_ts_standby2," .
+  "$source_ts_master=$source_ts_standby," .
+  "$target_ts_master=$target_ts_standby";
+$node_master->backup($backup_name, tablespace_mappings => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 tablespace_mappings =>
+					   "$ts_master=$ts_standby");
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3

v10-0003-Fix-replay-of-create-database-records-on-standby.patchapplication/octet-stream; name=v10-0003-Fix-replay-of-create-database-records-on-standby.patchDownload

From 26b385bdcc42fe25c933a17ed68d636f57271138 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v10 3/4] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 ++
 src/backend/access/transam/xlogutils.c | 155 +++++++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  53 +++++++++++
 src/backend/commands/tablespace.c      |   5 ++
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 223 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd93bcfaeba..af4dd19a71d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8032,6 +8032,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 322b0e8ff5b..bd98e42c895 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -59,6 +59,161 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (reachedConsistency)
+	{
+		if (dbNode == InvalidOid)
+			elog(PANIC, "cannot find directory %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(PANIC, "cannot find directory %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1c..4a3adc7c6fc 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2185,7 +2185,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2203,6 +2205,54 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		{
+			XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2260,6 +2310,9 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 051478057f6..5fd36f93197 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1516,6 +1517,10 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9f..34eecfab791 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.14.3

v10-0004-Fix-database-create-drop-wal-description.patchapplication/octet-stream; name=v10-0004-Fix-database-create-drop-wal-description.patchDownload

From 7a002cd379e13794edae53aa926898e33445475d Mon Sep 17 00:00:00 2001
From: Paul Guo <guopa@vmware.com>
Date: Mon, 6 Jul 2020 21:20:15 +0800
Subject: [PATCH v10 4/4] Fix database create/drop wal description.

Previously the description messages are wrong since the database path is not
simply tablespce_oid/database_oid. Now we call GetDatabasePath() to get the
correct database path.
---
 src/backend/access/rmgrdesc/dbasedesc.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index d82484b9db4..8312ef8bd36 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,14 +23,17 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char		*dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -39,8 +42,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfo(buf, "dir");
 		for (i = 0; i < xlrec->ntablespaces; i++)
-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  " %s", dbpath1);
+			pfree(dbpath1);
+		}
 	}
 }
 
-- 
2.14.3

Import Notes

Reply to msg id not found: 465BB7A1-87EC-43CB-8B9E-C6B161BE5F93@vmware.com

#41

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Paul Guo (#40)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Wed, 8 Jul 2020 12:56:44 +0000, Paul Guo <guopa@vmware.com> wrote in

On 2020/01/15 19:18, Paul Guo wrote:
I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

The approach that the 0003 patch uses is not the perfect solution.
If the standby crashes after tblspc_redo() removes the directory and before
its subsequent COMMIT record is replayed, PANIC error would occur since
there can be some unresolved missing directory entries when we reach the
consistent state. The problem would very rarely happen, though...
Just idea; calling XLogFlush() to update the minimum recovery point just
before tblspc_redo() performs destroy_tablespace_directories() may be
safe and helpful for the problem?

It seems to me that what the current patch does is too complex. What
we need to do here is to remember every invalid operation then forget
it when the prerequisite object is dropped.

When a table space is dropped before consistency is established, we
don't need to care what has been performed inside the tablespace. In
this perspective, it is enough to remember tablespace ids when failed
to do something inside it due to the absence of the tablespace and
then forget it when we remove it. We could remember individual
database id to show them in error messages, but I'm not sure it's
useful. The reason log_invalid_page records block numbers is to allow
the machinery handle partial table truncations, but this is not the
case since dropping tablespace cannot leave some of containing
databases.

As the result, we won't see an unresolved invalid operations in a
dropped tablespace.

Am I missing something?

dbase_redo:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
+        XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);

This means "record the belonging table space directory if it is not
found OR it is not a directory". The former can be valid but the
latter is unconditionally can not (I don't think we bother considering
symlinks there).

+    /*
+     * Source directory may be missing.  E.g. the template database used
+     * for creating this database may have been dropped, due to reasons
+     * noted above.  Moving a database from one tablespace may also be a
+     * partner in the crime.
+     */
+    if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+    {
+      XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);

This is a part of *creation* of the target directory. Lack of the
source directory cannot be valid even if the source directory is
dropped afterwards in the WAL stream and we can allow that if the
*target* tablespace is dropped afterwards. As the result, as I
mentioned above, we don't need to record about the database directory.

By the way the name XLogLogMiss.. is somewhat confusing. How about
XLogReportMissingDir (named after report_invalid_page).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#42

Paul Guo

guopa@vmware.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#41)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thanks for the review, please see the replies below.

On Jan 5, 2021, at 9:07 AM, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Wed, 8 Jul 2020 12:56:44 +0000, Paul Guo <guopa@vmware.com> wrote in

On 2020/01/15 19:18, Paul Guo wrote:
I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

The approach that the 0003 patch uses is not the perfect solution.
If the standby crashes after tblspc_redo() removes the directory and before
its subsequent COMMIT record is replayed, PANIC error would occur since
there can be some unresolved missing directory entries when we reach the
consistent state. The problem would very rarely happen, though...
Just idea; calling XLogFlush() to update the minimum recovery point just
before tblspc_redo() performs destroy_tablespace_directories() may be
safe and helpful for the problem?

It seems to me that what the current patch does is too complex. What
we need to do here is to remember every invalid operation then forget
it when the prerequisite object is dropped.

When a table space is dropped before consistency is established, we
don't need to care what has been performed inside the tablespace. In
this perspective, it is enough to remember tablespace ids when failed
to do something inside it due to the absence of the tablespace and
then forget it when we remove it. We could remember individual
database id to show them in error messages, but I'm not sure it's
useful. The reason log_invalid_page records block numbers is to allow
the machinery handle partial table truncations, but this is not the
case since dropping tablespace cannot leave some of containing
databases.

As the result, we won't see an unresolved invalid operations in a
dropped tablespace.

Am I missing something?

Yes, removing the database id from the hash key in the log/forget code should
be usually fine, but the previous code does stricter/safer checking.

Consider the scenario:

CREATE DATABASE newdb1 TEMPLATE template_db1;
CREATE DATABASE newdb2 TEMPLATE template_db2; <- in case the template_db2 database directory is missing abnormally somehow.
DROP DATABASE template_db1;

The previous code could detect this but if we remove the database id in the code,
this bad scenario is skipped.

dbase_redo:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
+        XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
This means "record the belonging table space directory if it is not
found OR it is not a directory". The former can be valid but the
latter is unconditionally can not (I don't think we bother considering
symlinks there).

Again this is a safer check, in the case the parent_path is a file for example somehow,
we should panic finally for the case and let the user checks and then does recovery again.

+    /*
+     * Source directory may be missing.  E.g. the template database used
+     * for creating this database may have been dropped, due to reasons
+     * noted above.  Moving a database from one tablespace may also be a
+     * partner in the crime.
+     */
+    if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+    {
+      XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
This is a part of *creation* of the target directory. Lack of the
source directory cannot be valid even if the source directory is
dropped afterwards in the WAL stream and we can allow that if the
*target* tablespace is dropped afterwards. As the result, as I
mentioned above, we don't need to record about the database directory.

By the way the name XLogLogMiss.. is somewhat confusing. How about
XLogReportMissingDir (named after report_invalid_page).

Agree with you.

Also your words remind me that we should skip the checking if the consistency point
is reached.

Here is a git diff against the previous patch. I’ll send out the new rebased patches after
the consensus is reached.

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7ade385965..c8fe3fe228 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -90,7 +90,7 @@ typedef struct xl_missing_dir
 static HTAB *missing_dir_tab = NULL;

 void
-XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
 {
 	xl_missing_dir_key key;
 	bool found;
@@ -103,16 +103,6 @@ XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
 	 */
 	Assert(OidIsValid(spcNode));

-	if (reachedConsistency)
-	{
-		if (dbNode == InvalidOid)
-			elog(PANIC, "cannot find directory %s (tablespace %d)",
-				 path, spcNode);
-		else
-			elog(PANIC, "cannot find directory %s (tablespace %d database %d)",
-				 path, spcNode, dbNode);
-	}
-
 	if (missing_dir_tab == NULL)
 	{
 		/* create hash table when first needed */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index fbff422c3b..7bd6d4efd9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2205,7 +2205,7 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
-		else
+		else if (!reachedConsistency)
 		{
 			/*
 			 * It is possible that drop tablespace record appearing later in
@@ -2221,7 +2221,7 @@ dbase_redo(XLogReaderState *record)
 			get_parent_directory(parent_path);
 			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
 			{
-				XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
 				skip = true;
 				ereport(WARNING,
 						(errmsg("skipping create database WAL record"),
@@ -2239,9 +2239,10 @@ dbase_redo(XLogReaderState *record)
 		 * noted above.  Moving a database from one tablespace may also be a
 		 * partner in the crime.
 		 */
-		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
 		{
-			XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
 			skip = true;
 			ereport(WARNING,
 					(errmsg("skipping create database WAL record"),
@@ -2311,7 +2312,8 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));

-			XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);

 			pfree(dst_path);
 		}
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 294c9676b4..15eaa757cc 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1534,7 +1534,8 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);

-		XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);

XLogFlush(record->EndRecPtr);

diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index da561af5ab..6561d9cebe 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,7 +23,7 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);

-extern void XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
 extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
 extern void XLogCheckMissingDirs(void);

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 748200ebb5..95eb6d26cc 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -141,7 +141,7 @@ $node_master->wait_for_catchup($node_standby, 'replay',
 $node_standby->safe_psql('postgres', 'CHECKPOINT');

 # Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
-# DATABASE / DROP TABLESPACE. This causes CREATE DATBASE WAL records
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records

#43

Alvaro Herrera

alvherre@2ndquadrant.com

almost 5 years ago

In reply to: Paul Guo (#42)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2021-Jan-27, Paul Guo wrote:

Here is a git diff against the previous patch. I’ll send out the new
rebased patches after the consensus is reached.

Hmm, can you post a rebased set, where the points under discussion
are marked in XXX comments explaining what the issue is? This thread is
long and old ago that it's pretty hard to navigate the whole thing in
order to find out exactly what is being questioned.

I think 0004 can be pushed without further ado, since it's a clear and
simple fix. 0001 needs a comment about the new parameter in
RecursiveCopy's POD documentation.

As I understand, this is a backpatchable bug-fix.

--
Álvaro Herrera 39°49'30"S 73°17'W

#44

Paul Guo

guopa@vmware.com

almost 5 years ago

In reply to: Alvaro Herrera (#43)

4 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2021/3/27, 10:23 PM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

Hmm, can you post a rebased set, where the points under discussion
are marked in XXX comments explaining what the issue is? This thread is
long and old ago that it's pretty hard to navigate the whole thing in
order to find out exactly what is being questioned.

OK. Attached are the rebased version that includes the change I discussed
in my previous reply. Also added POD documentation change for RecursiveCopy,
and modified the patch to use the backup_options introduced in
081876d75ea15c3bd2ee5ba64a794fd8ea46d794 for tablespace mapping.

I think 0004 can be pushed without further ado, since it's a clear and
simple fix. 0001 needs a comment about the new parameter in
RecursiveCopy's POD documentation.

Yeah, 0004 is no any risky. One concern seemed to be the compatibility of some
WAL dump/analysis tools(?). I have no idea about this. But if we do not backport
0004 we do not seem to need to worry about this.

As I understand, this is a backpatchable bug-fix.

Yes.

Thanks.

Attachments:

v11-0001-Support-node-initialization-from-backup-with-tab.patchapplication/octet-stream; name=v11-0001-Support-node-initialization-from-backup-with-tab.patchDownload

From 4075252c30fd5728913ef78b99f6c6cc70875ecd Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v11 1/4] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro Horiguchi
---
 src/test/perl/PostgresNode.pm  | 29 ++++++++++++++++++++++++++-
 src/test/perl/RecursiveCopy.pm | 45 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1e96357d7e..e27120e975 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -607,6 +607,32 @@ sub backup_fs_cold
 	return;
 }
 
+sub _srcsymlink
+{
+	my ($srcpath, $destpath) = @_;
+
+	croak "Cannot operate on symlink \"$srcpath\""
+		if ($srcpath !~ qr{/(pg_tblspc/[0-9]+)$});
+
+	# We have mapped tablespaces. Copy them individually
+	my $tmpdir = TestLib::tempdir;
+	my $dstrealdir = TestLib::perl2host($tmpdir);
+	my $srcrealdir = readlink($srcpath);
+
+	opendir(my $dh, $srcrealdir);
+	while (my $entry = (readdir $dh))
+	{
+		next if ($entry eq '.' or $entry eq '..');
+		my $spath = "$srcrealdir/$entry";
+		my $dpath = "$dstrealdir/$entry";
+		RecursiveCopy::copypath($spath, $dpath);
+	}
+	closedir $dh;
+
+	symlink $dstrealdir, $destpath;
+
+	return 1;
+}
 
 # Common sub of backup_fs_hot and backup_fs_cold
 sub _backup_fs
@@ -715,7 +741,8 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		RecursiveCopy::copypath($backup_path, $data_path);
+		RecursiveCopy::copypath($backup_path, $data_path,
+								srcsymlinkfn => \&_srcsymlink);
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..aaa3ce8a1b 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -47,6 +47,11 @@ This subroutine will be called for each entry in the source directory with its
 relative path as only parameter; if the subroutine returns true the entry is
 copied, otherwise the file is skipped.
 
+If the B<srcsymlinkfn> parameter is given, it must be a subroutine reference.
+This subroutine will be called when the source directory is a symlink. It
+determines the mechanism that copies files from the source directory to the
+destination directory.
+
 On failure the target directory may be in some incomplete state; no cleanup is
 attempted.
 
@@ -66,6 +71,7 @@ sub copypath
 {
 	my ($base_src_dir, $base_dest_dir, %params) = @_;
 	my $filterfn;
+	my $srcsymlinkfn;
 
 	if (defined $params{filterfn})
 	{
@@ -80,31 +86,55 @@ sub copypath
 		$filterfn = sub { return 1; };
 	}
 
+	if (defined $params{srcsymlinkfn})
+	{
+		croak "if specified, srcsymlinkfn must be a subroutine reference"
+			unless defined(ref $params{srcsymlinkfn})
+			and (ref $params{srcsymlinkfn} eq 'CODE');
+
+		$srcsymlinkfn = $params{srcsymlinkfn};
+	}
+	else
+	{
+		$srcsymlinkfn = undef;
+	}
+
 	# Complain if original path is bogus, because _copypath_recurse won't.
 	croak "\"$base_src_dir\" does not exist" if !-e $base_src_dir;
 
 	# Start recursive copy from current directory
-	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn);
+	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn, $srcsymlinkfn);
 }
 
 # Recursive private guts of copypath
 sub _copypath_recurse
 {
-	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn) = @_;
+	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn,
+		$srcsymlinkfn) = @_;
 	my $srcpath  = "$base_src_dir/$curr_path";
 	my $destpath = "$base_dest_dir/$curr_path";
 
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# If caller provided us with a callback, call it; otherwise we're out.
+	if (-l $srcpath)
+	{
+		if (defined $srcsymlinkfn)
+		{
+			return &$srcsymlinkfn($srcpath, $destpath);
+		}
+		else
+		{
+			croak "Cannot operate on symlink \"$srcpath\"";
+		}
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
@@ -137,7 +167,8 @@ sub _copypath_recurse
 		{
 			next if ($entry eq '.' or $entry eq '..');
 			_copypath_recurse($base_src_dir, $base_dest_dir,
-				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn)
+				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn,
+				$srcsymlinkfn)
 			  or die "copypath $srcpath/$entry -> $destpath/$entry failed";
 		}
 
-- 
2.14.3

v11-0002-Tests-to-replay-create-database-operation-on-sta.patchapplication/octet-stream; name=v11-0002-Tests-to-replay-create-database-operation-on-sta.patchDownload

From 154497b47de2de423b1ecf420d45abf2af0e1139 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v11 2/4] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/perl/PostgresNode.pm             |  21 +++-
 src/test/recovery/t/011_crash_recovery.pl | 162 +++++++++++++++++++++++++++++-
 2 files changed, 177 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e27120e975..d4e17644b4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1888,15 +1888,26 @@ Returns 1 if successful, 0 if timed out.
 
 sub poll_query_until
 {
-	my ($self, $dbname, $query, $expected) = @_;
+	my ($self, $dbname, $query, $params) = @_;
+	my $expected;
+
+	# Be backwards-compatible
+	if (defined $params and ref $params eq '')
+	{
+		$params = {
+			expected => $params,
+			timeout => 180
+		};
+	}
 
 	local %ENV = $self->_get_env();
 
-	$expected = 't' unless defined($expected);    # default value
+	$params->{expected} = 't' unless defined($params->{expected});
+	$params->{timeout} = 180 unless defined($params->{timeout});
 
 	my $cmd = [ 'psql', '-XAt', '-c', $query, '-d', $self->connstr($dbname) ];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params->{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -1906,7 +1917,7 @@ sub poll_query_until
 		$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
 		chomp($stdout);
 
-		if ($stdout eq $expected)
+		if ($stdout eq $params->{expected})
 		{
 			return 1;
 		}
@@ -1924,7 +1935,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params->{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 10cd98f70a..faae3fb5e7 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -6,9 +6,10 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 
-plan tests => 3;
+plan tests => 5;
 
 my $node = get_new_node('primary');
 $node->init(allows_streaming => 1);
@@ -59,3 +60,162 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish; # wait for psql to quit gracefully
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is handled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    database.
+#
+# 2. Create a database against a user-defined tablespace then drop the
+#    database and the tablespace.
+#
+# 3. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 4. Create a database from another database as template then drop the
+#    template database.
+#
+#
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master1 = TestLib::tempdir;
+$dropme_ts_master1 = TestLib::perl2host($dropme_ts_master1);
+my $dropme_ts_master2 = TestLib::tempdir;
+$dropme_ts_master2 = TestLib::perl2host($dropme_ts_master2);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts1 location '$dropme_ts_master1';
+						   CREATE TABLESPACE dropme_ts2 location '$dropme_ts_master2';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby1 = TestLib::tempdir;
+$dropme_ts_standby1 = TestLib::perl2host($dropme_ts_standby1);
+my $dropme_ts_standby2 = TestLib::tempdir;
+$dropme_ts_standby2 = TestLib::perl2host($dropme_ts_standby2);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = [ "--tablespace-mapping=$dropme_ts_master1=$dropme_ts_standby1",
+  "--tablespace-mapping=$dropme_ts_master2=$dropme_ts_standby2",
+  "--tablespace-mapping=$source_ts_master=$source_ts_standby",
+  "--tablespace-mapping=$target_ts_master=$target_ts_standby" ];
+$node_master->backup($backup_name, backup_options => $ts_mapping);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = get_new_node('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 backup_options =>
+					   [ "--tablespace-mapping=$ts_master=$ts_standby" ]);
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3

v11-0003-Fix-replay-of-create-database-records-on-standby.patchapplication/octet-stream; name=v11-0003-Fix-replay-of-create-database-records-on-standby.patchDownload

From 16bf763dda617d6aaf454f8aff3c4d08e09b198b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v11 3/4] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 ++
 src/backend/access/transam/xlogutils.c | 145 +++++++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  55 +++++++++++++
 src/backend/commands/tablespace.c      |   6 ++
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 216 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..b525679e8b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8175,6 +8175,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..8341b7a242 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -59,6 +59,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..7bd6d4efd9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2185,7 +2185,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2203,6 +2205,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2260,6 +2311,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 69ea155d50..15eaa757cc 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1533,6 +1534,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 9ac602b674..6561d9cebe 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.14.3

v11-0004-Fix-database-create-drop-wal-description.patchapplication/octet-stream; name=v11-0004-Fix-database-create-drop-wal-description.patchDownload

From 1939f82fc67d218b6990b8a694709f0a7cb59430 Mon Sep 17 00:00:00 2001
From: Paul Guo <guopa@vmware.com>
Date: Mon, 6 Jul 2020 21:20:15 +0800
Subject: [PATCH v11 4/4] Fix database create/drop wal description.

Previously the description messages are wrong since the database path is not
simply tablespce_oid/database_oid. Now we call GetDatabasePath() to get the
correct database path.

Authored by Paul Guo
---
 src/backend/access/rmgrdesc/dbasedesc.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 26609845aa..873737161c 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,14 +23,17 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char		*dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -39,8 +42,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfoString(buf, "dir");
 		for (i = 0; i < xlrec->ntablespaces; i++)
-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  " %s", dbpath1);
+			pfree(dbpath1);
+		}
 	}
 }
 
-- 
2.14.3

#45

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 4 years ago

In reply to: Paul Guo (#44)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Tue, Mar 30, 2021 at 12:12 PM Paul Guo <guopa@vmware.com> wrote:

On 2021/3/27, 10:23 PM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

Hmm, can you post a rebased set, where the points under discussion
are marked in XXX comments explaining what the issue is? This thread

is

long and old ago that it's pretty hard to navigate the whole thing in
order to find out exactly what is being questioned.

OK. Attached are the rebased version that includes the change I discussed
in my previous reply. Also added POD documentation change for
RecursiveCopy,
and modified the patch to use the backup_options introduced in
081876d75ea15c3bd2ee5ba64a794fd8ea46d794 for tablespace mapping.

I think 0004 can be pushed without further ado, since it's a clear and
simple fix. 0001 needs a comment about the new parameter in
RecursiveCopy's POD documentation.

Yeah, 0004 is no any risky. One concern seemed to be the compatibility of
some
WAL dump/analysis tools(?). I have no idea about this. But if we do not
backport
0004 we do not seem to need to worry about this.

As I understand, this is a backpatchable bug-fix.

Yes.

Thanks.

Patch does not apply successfully,

http://cfbot.cputube.org/patch_33_2161.log

Can you please rebase the patch.

--
Ibrar Ahmed

#46

Paul Guo

guopa@vmware.com

over 4 years ago

In reply to: Ibrar Ahmed (#45)

4 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Rebased.

Attachments:

v12-0004-Fix-database-create-drop-wal-description.patchapplication/octet-stream; name=v12-0004-Fix-database-create-drop-wal-description.patchDownload

From 7c03826e7fb9872ea293f95d8270b3e64433949a Mon Sep 17 00:00:00 2001
From: Paul Guo <guopa@vmware.com>
Date: Mon, 6 Jul 2020 21:20:15 +0800
Subject: [PATCH v12 4/4] Fix database create/drop wal description.

Previously the description messages are wrong since the database path is not
simply tablespce_oid/database_oid. Now we call GetDatabasePath() to get the
correct database path.

Authored by Paul Guo
---
 src/backend/access/rmgrdesc/dbasedesc.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 26609845aa..873737161c 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -23,14 +23,17 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	char		*dbpath1, *dbpath2;
 
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
-						 xlrec->tablespace_id, xlrec->db_id);
+		dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+		dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+		pfree(dbpath2);
+		pfree(dbpath1);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -39,8 +42,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfoString(buf, "dir");
 		for (i = 0; i < xlrec->ntablespaces; i++)
-			appendStringInfo(buf, " %u/%u",
-							 xlrec->tablespace_ids[i], xlrec->db_id);
+		{
+			dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+			appendStringInfo(buf,  " %s", dbpath1);
+			pfree(dbpath1);
+		}
 	}
 }
 
-- 
2.14.3

v12-0002-Tests-to-replay-create-database-operation-on-sta.patchapplication/octet-stream; name=v12-0002-Tests-to-replay-create-database-operation-on-sta.patchDownload

From 6096936d13526f121db4d5f684053cdfb4881ebe Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:34:19 +0530
Subject: [PATCH v12 2/4] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/perl/PostgresNode.pm             |  21 +++-
 src/test/recovery/t/011_crash_recovery.pl | 162 +++++++++++++++++++++++++++++-
 2 files changed, 177 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index b71e98f254..86e5b2f7be 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2183,18 +2183,29 @@ Returns 1 if successful, 0 if timed out.
 
 sub poll_query_until
 {
-	my ($self, $dbname, $query, $expected) = @_;
+	my ($self, $dbname, $query, $params) = @_;
+	my $expected;
+
+	# Be backwards-compatible
+	if (defined $params and ref $params eq '')
+	{
+		$params = {
+			expected => $params,
+			timeout => 180
+		};
+	}
 
 	local %ENV = $self->_get_env();
 
-	$expected = 't' unless defined($expected);    # default value
+	$params->{expected} = 't' unless defined($params->{expected});
+	$params->{timeout} = 180 unless defined($params->{timeout});
 
 	my $cmd = [
 		$self->installed_command('psql'), '-XAt',
 		'-d',                             $self->connstr($dbname)
 	];
 	my ($stdout, $stderr);
-	my $max_attempts = 180 * 10;
+	my $max_attempts = $params->{timeout} * 10;
 	my $attempts     = 0;
 
 	while ($attempts < $max_attempts)
@@ -2207,7 +2218,7 @@ sub poll_query_until
 		$stderr =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
 		chomp($stderr);
 
-		if ($stdout eq $expected && $stderr eq '')
+		if ($stdout eq $params->{expected} && $stderr eq '')
 		{
 			return 1;
 		}
@@ -2223,7 +2234,7 @@ sub poll_query_until
 	diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
-$expected
+$params->{expected}
 last actual query output:
 $stdout
 with stderr:
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 72fc603e6d..fec330e808 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,9 +9,10 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 
-plan tests => 3;
+plan tests => 5;
 
 my $node = PostgresNode->new('primary');
 $node->init(allows_streaming => 1);
@@ -62,3 +63,162 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is handled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    database.
+#
+# 2. Create a database against a user-defined tablespace then drop the
+#    database and the tablespace.
+#
+# 3. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 4. Create a database from another database as template then drop the
+#    template database.
+#
+#
+
+my $node_master = PostgresNode->new('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master1 = TestLib::tempdir;
+$dropme_ts_master1 = TestLib::perl2host($dropme_ts_master1);
+my $dropme_ts_master2 = TestLib::tempdir;
+$dropme_ts_master2 = TestLib::perl2host($dropme_ts_master2);
+my $source_ts_master = TestLib::tempdir;
+$source_ts_master = TestLib::perl2host($source_ts_master);
+my $target_ts_master = TestLib::tempdir;
+$target_ts_master = TestLib::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts1 location '$dropme_ts_master1';
+						   CREATE TABLESPACE dropme_ts2 location '$dropme_ts_master2';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby1 = TestLib::tempdir;
+$dropme_ts_standby1 = TestLib::perl2host($dropme_ts_standby1);
+my $dropme_ts_standby2 = TestLib::tempdir;
+$dropme_ts_standby2 = TestLib::perl2host($dropme_ts_standby2);
+my $source_ts_standby = TestLib::tempdir;
+$source_ts_standby = TestLib::perl2host($source_ts_standby);
+my $target_ts_standby = TestLib::tempdir;
+$target_ts_standby = TestLib::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = [ "--tablespace-mapping=$dropme_ts_master1=$dropme_ts_standby1",
+  "--tablespace-mapping=$dropme_ts_master2=$dropme_ts_standby2",
+  "--tablespace-mapping=$source_ts_master=$source_ts_standby",
+  "--tablespace-mapping=$target_ts_master=$target_ts_standby" ];
+$node_master->backup($backup_name, backup_options => $ts_mapping);
+
+my $node_standby = PostgresNode->new('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = PostgresNode->new('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = TestLib::tempdir;
+$ts_master = TestLib::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = TestLib::tempdir("standby");
+$ts_standby = TestLib::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 backup_options =>
+					   [ "--tablespace-mapping=$ts_master=$ts_standby" ]);
+$node_standby = PostgresNode->new('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.14.3

v12-0001-Support-node-initialization-from-backup-with-tab.patchapplication/octet-stream; name=v12-0001-Support-node-initialization-from-backup-with-tab.patchDownload

From 5cd04cebf8361cfd7941063cab055a2608ff546f Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v12 1/4] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.

Authored by Kyotaro Horiguchi
---
 src/test/perl/PostgresNode.pm  | 29 ++++++++++++++++++++++++++-
 src/test/perl/RecursiveCopy.pm | 45 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8158ea5b2f..b71e98f254 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -634,6 +634,32 @@ sub backup_fs_cold
 	return;
 }
 
+sub _srcsymlink
+{
+	my ($srcpath, $destpath) = @_;
+
+	croak "Cannot operate on symlink \"$srcpath\""
+		if ($srcpath !~ qr{/(pg_tblspc/[0-9]+)$});
+
+	# We have mapped tablespaces. Copy them individually
+	my $tmpdir = TestLib::tempdir;
+	my $dstrealdir = TestLib::perl2host($tmpdir);
+	my $srcrealdir = readlink($srcpath);
+
+	opendir(my $dh, $srcrealdir);
+	while (my $entry = (readdir $dh))
+	{
+		next if ($entry eq '.' or $entry eq '..');
+		my $spath = "$srcrealdir/$entry";
+		my $dpath = "$dstrealdir/$entry";
+		RecursiveCopy::copypath($spath, $dpath);
+	}
+	closedir $dh;
+
+	symlink $dstrealdir, $destpath;
+
+	return 1;
+}
 
 # Common sub of backup_fs_hot and backup_fs_cold
 sub _backup_fs
@@ -743,7 +769,8 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		RecursiveCopy::copypath($backup_path, $data_path);
+		RecursiveCopy::copypath($backup_path, $data_path,
+								srcsymlinkfn => \&_srcsymlink);
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index 8a9cc722b5..cbd7a874d7 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -49,6 +49,11 @@ This subroutine will be called for each entry in the source directory with its
 relative path as only parameter; if the subroutine returns true the entry is
 copied, otherwise the file is skipped.
 
+If the B<srcsymlinkfn> parameter is given, it must be a subroutine reference.
+This subroutine will be called when the source directory is a symlink. It
+determines the mechanism that copies files from the source directory to the
+destination directory.
+
 On failure the target directory may be in some incomplete state; no cleanup is
 attempted.
 
@@ -68,6 +73,7 @@ sub copypath
 {
 	my ($base_src_dir, $base_dest_dir, %params) = @_;
 	my $filterfn;
+	my $srcsymlinkfn;
 
 	if (defined $params{filterfn})
 	{
@@ -82,31 +88,55 @@ sub copypath
 		$filterfn = sub { return 1; };
 	}
 
+	if (defined $params{srcsymlinkfn})
+	{
+		croak "if specified, srcsymlinkfn must be a subroutine reference"
+			unless defined(ref $params{srcsymlinkfn})
+			and (ref $params{srcsymlinkfn} eq 'CODE');
+
+		$srcsymlinkfn = $params{srcsymlinkfn};
+	}
+	else
+	{
+		$srcsymlinkfn = undef;
+	}
+
 	# Complain if original path is bogus, because _copypath_recurse won't.
 	croak "\"$base_src_dir\" does not exist" if !-e $base_src_dir;
 
 	# Start recursive copy from current directory
-	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn);
+	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn, $srcsymlinkfn);
 }
 
 # Recursive private guts of copypath
 sub _copypath_recurse
 {
-	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn) = @_;
+	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn,
+		$srcsymlinkfn) = @_;
 	my $srcpath  = "$base_src_dir/$curr_path";
 	my $destpath = "$base_dest_dir/$curr_path";
 
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# If caller provided us with a callback, call it; otherwise we're out.
+	if (-l $srcpath)
+	{
+		if (defined $srcsymlinkfn)
+		{
+			return &$srcsymlinkfn($srcpath, $destpath);
+		}
+		else
+		{
+			croak "Cannot operate on symlink \"$srcpath\"";
+		}
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
@@ -139,7 +169,8 @@ sub _copypath_recurse
 		{
 			next if ($entry eq '.' or $entry eq '..');
 			_copypath_recurse($base_src_dir, $base_dest_dir,
-				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn)
+				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn,
+				$srcsymlinkfn)
 			  or die "copypath $srcpath/$entry -> $destpath/$entry failed";
 		}
 
-- 
2.14.3

v12-0003-Fix-replay-of-create-database-records-on-standby.patchapplication/octet-stream; name=v12-0003-Fix-replay-of-create-database-records-on-standby.patchDownload

From 42eda96adda2815ac91f4323964d8cbf7863b23c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v12 3/4] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 ++
 src/backend/access/transam/xlogutils.c | 145 +++++++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  55 +++++++++++++
 src/backend/commands/tablespace.c      |   5 ++
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 215 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b39a2fdaa..d48716b506 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8172,6 +8172,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b1702bc6be..13d0f1c101 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab48df..0f483edb71 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2143,7 +2143,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2161,6 +2163,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2218,6 +2269,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index a54239a8b3..2a4d14e29f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1531,6 +1531,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index a5cb3d322c..2ee3bd378d 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.14.3

#47

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Paul Guo (#46)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Thu, Aug 5, 2021 at 6:20 AM Paul Guo <guopa@vmware.com> wrote:

Rebased.

The commit message for 0001 is not clear enough for me to understand
what problem it's supposed to be fixing. The code comments aren't
really either. They make it sound like there's some problem with
copying symlinks but mostly they just talk about callbacks, which
doesn't really help me understand what problem we'd have if we just
didn't commit this (or reverted it later).

I am not really convinced by Álvaro's claim that 0004 is a "fix"; I
think I'd call it an improvement. But either way I agree that could
just be committed.

I haven't analyzed 0002 and 0003 yet.

--
Robert Haas
EDB: http://www.enterprisedb.com

#48

Paul Guo

paulguo@gmail.com

over 4 years ago

In reply to: Robert Haas (#47)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Aug 11, 2021 at 4:56 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 5, 2021 at 6:20 AM Paul Guo <guopa@vmware.com> wrote:

Rebased.

The commit message for 0001 is not clear enough for me to understand
what problem it's supposed to be fixing. The code comments aren't
really either. They make it sound like there's some problem with
copying symlinks but mostly they just talk about callbacks, which
doesn't really help me understand what problem we'd have if we just
didn't commit this (or reverted it later).

Thanks for reviewing. Let me explain a bit. The patch series includes
four patches.

0001 and 0002 are test changes for the fix (0003).
- 0001 is the test framework change that's needed by 0002.
- 0002 is the test for the code fix (0003).
0003 is the code change and the commit message explains the issue in detail.
0004 as said is a small enhancement which is a bit independent of the
previous patches.

Basically the issue is that without the fix crash recovery might fail
relevant to tablespace.
Here is the log after I run the tests in 0001/0002 without the 0003 fix.

2021-08-04 10:00:42.231 CST [875] FATAL: could not create directory
"pg_tblspc/16385/PG_15_202107261/16390": No such file or directory
2021-08-04 10:00:42.231 CST [875] CONTEXT: WAL redo at 0/3001320 for
Database/CREATE: copy dir base/1 to
pg_tblspc/16385/PG_15_202107261/16390

I am not really convinced by Álvaro's claim that 0004 is a "fix"; I
think I'd call it an improvement. But either way I agree that could
just be committed.

I haven't analyzed 0002 and 0003 yet.

--
Robert Haas
EDB: http://www.enterprisedb.com

--
Paul Guo (Vmware)

#49

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Paul Guo (#48)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Aug 11, 2021 at 3:59 AM Paul Guo <paulguo@gmail.com> wrote:

Thanks for reviewing. Let me explain a bit. The patch series includes
four patches.

0001 and 0002 are test changes for the fix (0003).
- 0001 is the test framework change that's needed by 0002.
- 0002 is the test for the code fix (0003).
0003 is the code change and the commit message explains the issue in detail.
0004 as said is a small enhancement which is a bit independent of the
previous patches.

Basically the issue is that without the fix crash recovery might fail
relevant to tablespace.
Here is the log after I run the tests in 0001/0002 without the 0003 fix.

I do understand all of this, but I (or whoever might commit this)
needs to also be able to understand specifically what each patch is
doing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#50

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#47)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Robert Haas <robertmhaas@gmail.com> writes:

The commit message for 0001 is not clear enough for me to understand
what problem it's supposed to be fixing. The code comments aren't
really either. They make it sound like there's some problem with
copying symlinks but mostly they just talk about callbacks, which
doesn't really help me understand what problem we'd have if we just
didn't commit this (or reverted it later).

I am not really convinced by Álvaro's claim that 0004 is a "fix"; I
think I'd call it an improvement. But either way I agree that could
just be committed.

I haven't analyzed 0002 and 0003 yet.

I took a quick look through this:

* I don't like 0001 either, though it seems like the issue is mostly
documentation. sub _srcsymlink should have a comment explaining
what it's doing and why. The documentation of copypath's new parameter
seems like gobbledegook too --- I suppose it should read more like
"By default, copypath fails if a source item is a symlink. But if
B<srcsymlinkfn> is provided, that subroutine is called to process any
symlink."

* I'm allergic to 0002's completely undocumented changes to
poll_query_until, especially since I don't see anything in the
patch that actually uses them. Can't we just drop these diffs
in PostgresNode.pm? BTW, the last error message in the patch,
talking about a 5-second timeout, seems wrong. With or without
these changes, poll_query_until's default timeout is 180 sec.
The actual test case might be okay other than that nit and a
comment typo or two.

* 0003 might actually be okay. I've not read it line-by-line,
but it seems like it's implementing a sane solution and it's
adequately commented.

* I'm inclined to reject 0004 out of hand, because I don't
agree with what it's doing. The purpose of the rmgrdesc
functions is to show you what is in the WAL records, and
everywhere else we interpret that as "show the verbatim,
numeric field contents". heapdesc.c, for example, doesn't
attempt to look up the name of the table being operated on.
0004 isn't adhering to that style, and aside from being
inconsistent I'm afraid that it's adding failure modes
we don't want.

regards, tom lane

#51

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: Tom Lane (#50)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 24 Sep 2021, at 20:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The commit message for 0001 is not clear enough for me to understand
what problem it's supposed to be fixing. The code comments aren't
really either. They make it sound like there's some problem with
copying symlinks but mostly they just talk about callbacks, which
doesn't really help me understand what problem we'd have if we just
didn't commit this (or reverted it later).

I am not really convinced by Álvaro's claim that 0004 is a "fix"; I
think I'd call it an improvement. But either way I agree that could
just be committed.

I haven't analyzed 0002 and 0003 yet.

I took a quick look through this:

* I don't like 0001 either, though it seems like the issue is mostly
documentation. sub _srcsymlink should have a comment explaining
what it's doing and why. The documentation of copypath's new parameter
seems like gobbledegook too --- I suppose it should read more like
"By default, copypath fails if a source item is a symlink. But if
B<srcsymlinkfn> is provided, that subroutine is called to process any
symlink."

* I'm allergic to 0002's completely undocumented changes to
poll_query_until, especially since I don't see anything in the
patch that actually uses them. Can't we just drop these diffs
in PostgresNode.pm? BTW, the last error message in the patch,
talking about a 5-second timeout, seems wrong. With or without
these changes, poll_query_until's default timeout is 180 sec.
The actual test case might be okay other than that nit and a
comment typo or two.

* 0003 might actually be okay. I've not read it line-by-line,
but it seems like it's implementing a sane solution and it's
adequately commented.

* I'm inclined to reject 0004 out of hand, because I don't
agree with what it's doing. The purpose of the rmgrdesc
functions is to show you what is in the WAL records, and
everywhere else we interpret that as "show the verbatim,
numeric field contents". heapdesc.c, for example, doesn't
attempt to look up the name of the table being operated on.
0004 isn't adhering to that style, and aside from being
inconsistent I'm afraid that it's adding failure modes
we don't want.

This patch again fails to apply (seemingly from the Perl namespace work on the
testcode), and needs a few updates as per the above review.

--
Daniel Gustafsson https://vmware.com/

#52

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Daniel Gustafsson (#51)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 4 Nov 2021 13:34:33 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in

This patch again fails to apply (seemingly from the Perl namespace work on the
testcode), and needs a few updates as per the above review.

Rebased the latest patch removing some of the chages.

0001: (I don't remember about this, though) I don't see how to make it
work on Windows. Anyway the next step would be to write comments.

0002: I didin't see it in details and didn't check if it finds the
issue but it actually scceeds with the fix. The change to
poll_query_until is removed since it doesn't seem actually used.

0003: The fix. I didn't touch this.

0004: Removed at all. I agree to Tom. (And I faintly remember that I
said something like that.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v13-0001-Support-node-initialization-from-backup-with-tab.patchtext/x-patch; charset=us-asciiDownload

From aa6b0b94e42550f23c4cecfa23ea1face7d71bc6 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Mon, 8 Nov 2021 17:32:30 +0900
Subject: [PATCH v13 1/3] Support node initialization from backup with
 tablespaces

User defined tablespaces appear as symlinks in in the backup.  This
commit tweaks recursive copy subroutine to allow for symlinks specific
to tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      | 29 +++++++++++-
 .../perl/PostgreSQL/Test/RecursiveCopy.pm     | 45 ++++++++++++++++---
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 9467a199c8..19a667ebe4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -634,6 +634,32 @@ sub backup_fs_cold
 	return;
 }
 
+sub _srcsymlink
+{
+	my ($srcpath, $destpath) = @_;
+
+	croak "Cannot operate on symlink \"$srcpath\""
+		if ($srcpath !~ qr{/(pg_tblspc/[0-9]+)$});
+
+	# We have mapped tablespaces. Copy them individually
+	my $tmpdir = PostgreSQL::Test::Utils::tempdir();
+	my $dstrealdir = PostgreSQL::Test::Utils::perl2host($tmpdir);
+	my $srcrealdir = readlink($srcpath);
+
+	opendir(my $dh, $srcrealdir);
+	while (my $entry = (readdir $dh))
+	{
+		next if ($entry eq '.' or $entry eq '..');
+		my $spath = "$srcrealdir/$entry";
+		my $dpath = "$dstrealdir/$entry";
+		PostgreSQL::Test::RecursiveCopy::copypath($spath, $dpath);
+	}
+	closedir $dh;
+
+	symlink $dstrealdir, $destpath;
+
+	return 1;
+}
 
 # Common sub of backup_fs_hot and backup_fs_cold
 sub _backup_fs
@@ -743,7 +769,8 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path,
+								srcsymlinkfn => \&_srcsymlink);
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/RecursiveCopy.pm b/src/test/perl/PostgreSQL/Test/RecursiveCopy.pm
index dd320a605e..2a636cef84 100644
--- a/src/test/perl/PostgreSQL/Test/RecursiveCopy.pm
+++ b/src/test/perl/PostgreSQL/Test/RecursiveCopy.pm
@@ -49,6 +49,11 @@ This subroutine will be called for each entry in the source directory with its
 relative path as only parameter; if the subroutine returns true the entry is
 copied, otherwise the file is skipped.
 
+If the B<srcsymlinkfn> parameter is given, it must be a subroutine reference.
+This subroutine will be called when the source directory is a symlink. It
+determines the mechanism that copies files from the source directory to the
+destination directory.
+
 On failure the target directory may be in some incomplete state; no cleanup is
 attempted.
 
@@ -68,6 +73,7 @@ sub copypath
 {
 	my ($base_src_dir, $base_dest_dir, %params) = @_;
 	my $filterfn;
+	my $srcsymlinkfn;
 
 	if (defined $params{filterfn})
 	{
@@ -82,31 +88,55 @@ sub copypath
 		$filterfn = sub { return 1; };
 	}
 
+	if (defined $params{srcsymlinkfn})
+	{
+		croak "if specified, srcsymlinkfn must be a subroutine reference"
+			unless defined(ref $params{srcsymlinkfn})
+			and (ref $params{srcsymlinkfn} eq 'CODE');
+
+		$srcsymlinkfn = $params{srcsymlinkfn};
+	}
+	else
+	{
+		$srcsymlinkfn = undef;
+	}
+
 	# Complain if original path is bogus, because _copypath_recurse won't.
 	croak "\"$base_src_dir\" does not exist" if !-e $base_src_dir;
 
 	# Start recursive copy from current directory
-	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn);
+	return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn, $srcsymlinkfn);
 }
 
 # Recursive private guts of copypath
 sub _copypath_recurse
 {
-	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn) = @_;
+	my ($base_src_dir, $base_dest_dir, $curr_path, $filterfn,
+		$srcsymlinkfn) = @_;
 	my $srcpath  = "$base_src_dir/$curr_path";
 	my $destpath = "$base_dest_dir/$curr_path";
 
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# If caller provided us with a callback, call it; otherwise we're out.
+	if (-l $srcpath)
+	{
+		if (defined $srcsymlinkfn)
+		{
+			return &$srcsymlinkfn($srcpath, $destpath);
+		}
+		else
+		{
+			croak "Cannot operate on symlink \"$srcpath\"";
+		}
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
@@ -139,7 +169,8 @@ sub _copypath_recurse
 		{
 			next if ($entry eq '.' or $entry eq '..');
 			_copypath_recurse($base_src_dir, $base_dest_dir,
-				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn)
+				$curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn,
+				$srcsymlinkfn)
 			  or die "copypath $srcpath/$entry -> $destpath/$entry failed";
 		}
 
-- 
2.27.0

v13-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From 030f30d330dba3a6c3ff3f9561348375d30a1486 Mon Sep 17 00:00:00 2001
From: Asim R P <apraveen@pivotal.io>
Date: Fri, 20 Sep 2019 17:31:25 +0530
Subject: [PATCH v13 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 162 +++++++++++++++++++++-
 1 file changed, 161 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index d7806e6671..a4e1fcb5dc 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,9 +9,10 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use File::Path qw(rmtree);
 use Config;
 
-plan tests => 3;
+plan tests => 5;
 
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
@@ -62,3 +63,162 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+# TEST 4
+#
+# Ensure that a missing tablespace directory during crash recovery on
+# a standby is handled correctly.  The standby should finish crash
+# recovery successfully because a matching drop database record is
+# found in the WAL.  The following scnearios are covered:
+#
+# 1. Create a database against a user-defined tablespace then drop the
+#    database.
+#
+# 2. Create a database against a user-defined tablespace then drop the
+#    database and the tablespace.
+#
+# 3. Move a database from source tablespace to target tablespace then
+#    drop the source tablespace.
+#
+# 4. Create a database from another database as template then drop the
+#    template database.
+#
+#
+
+my $node_master = PostgreSQL::Test::Cluster->new('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $dropme_ts_master1 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_master1 = PostgreSQL::Test::Utils::perl2host($dropme_ts_master1);
+my $dropme_ts_master2 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_master2 = PostgreSQL::Test::Utils::perl2host($dropme_ts_master2);
+my $source_ts_master = PostgreSQL::Test::Utils::tempdir();
+$source_ts_master = PostgreSQL::Test::Utils::perl2host($source_ts_master);
+my $target_ts_master = PostgreSQL::Test::Utils::tempdir();
+$target_ts_master = PostgreSQL::Test::Utils::perl2host($target_ts_master);
+
+$node_master->safe_psql('postgres',
+						qq[CREATE TABLESPACE dropme_ts1 location '$dropme_ts_master1';
+						   CREATE TABLESPACE dropme_ts2 location '$dropme_ts_master2';
+						   CREATE TABLESPACE source_ts location '$source_ts_master';
+						   CREATE TABLESPACE target_ts location '$target_ts_master';
+						   CREATE DATABASE template_db IS_TEMPLATE = true;]);
+
+my $dropme_ts_standby1 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_standby1 = PostgreSQL::Test::Utils::perl2host($dropme_ts_standby1);
+my $dropme_ts_standby2 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_standby2 = PostgreSQL::Test::Utils::perl2host($dropme_ts_standby2);
+my $source_ts_standby = PostgreSQL::Test::Utils::tempdir();
+$source_ts_standby = PostgreSQL::Test::Utils::perl2host($source_ts_standby);
+my $target_ts_standby = PostgreSQL::Test::Utils::tempdir();
+$target_ts_standby = PostgreSQL::Test::Utils::perl2host($target_ts_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+my $ts_mapping = [ "--tablespace-mapping=$dropme_ts_master1=$dropme_ts_standby1",
+  "--tablespace-mapping=$dropme_ts_master2=$dropme_ts_standby2",
+  "--tablespace-mapping=$source_ts_master=$source_ts_standby",
+  "--tablespace-mapping=$target_ts_master=$target_ts_standby" ];
+$node_master->backup($backup_name, backup_options => $ts_mapping);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_master->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+							   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_master = PostgreSQL::Test::Cluster->new('master3');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $ts_master = PostgreSQL::Test::Utils::tempdir();
+$ts_master = PostgreSQL::Test::Utils::perl2host($ts_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$ts_master'");
+$node_master->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+my $ts_standby = PostgreSQL::Test::Utils::tempdir("standby");
+$ts_standby = PostgreSQL::Test::Utils::perl2host($ts_standby);
+
+# Take backup
+$backup_name = 'my_backup';
+$node_master->backup($backup_name,
+					 backup_options =>
+					   [ "--tablespace-mapping=$ts_master=$ts_standby" ]);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+#
+# The tablespace mapping is lost when the standby node is initialized
+# from basebackup because RecursiveCopy::copypath creates a new temp
+# directory for each tablspace symlink found in backup.  We must
+# obtain the correct tablespace directory by querying standby.
+$ts_standby = $node_standby->safe_psql(
+	'postgres',
+	"select pg_tablespace_location(oid) from pg_tablespace where spcname = 'ts1'");
+rmtree($ts_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_master->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_master->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.27.0

v13-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From 42d379e23f99b91565c24e23073a6da14bf98f19 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v13 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 +
 src/backend/access/transam/xlogutils.c | 145 +++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  55 ++++++++++
 src/backend/commands/tablespace.c      |   5 +
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 215 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f..c6d5fc782f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8313,6 +8313,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e0531ed..99abf8b2f4 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab48df..0f483edb71 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2143,7 +2143,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2161,6 +2163,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2218,6 +2269,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4b96eec9df..0d5dfe007f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1527,6 +1527,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index eebc91f3a5..3341efc052 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#53

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Kyotaro Horiguchi (#52)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Nov 08, 2021 at 05:55:16PM +0900, Kyotaro Horiguchi wrote:

I have quickly looked at the patch set.

0001: (I don't remember about this, though) I don't see how to make it
work on Windows. Anyway the next step would be to write comments.

Look at Utils.pm where we have dir_symlink, then. symlink() does not
work on WIN32, so we have a wrapper that uses junction points. FWIW,
I don't like much the behavior you are enforcing in init_from_backup
when coldly copying a source path, but I have not looked enough at the
patch set to have a strong opinion about this part, either.

0002: I didn't see it in details and didn't check if it finds the
issue but it actually scceeds with the fix. The change to
poll_query_until is removed since it doesn't seem actually used.

+# Create tablespace
+my $dropme_ts_master1 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_master1 =
PostgreSQL::Test::Utils::perl2host($dropme_ts_master1);
+my $dropme_ts_master2 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_master2 =
PostgreSQL::Test::Utils::perl2host($dropme_ts_master2);
+my $source_ts_master = PostgreSQL::Test::Utils::tempdir();
+$source_ts_master =
PostgreSQL::Test::Utils::perl2host($source_ts_master);
+my $target_ts_master = PostgreSQL::Test::Utils::tempdir();
+$target_ts_master =
PostgreSQL::Test::Utils::perl2host($target_ts_master);

Rather than creating N temporary directories, it would be simpler to
create only one, and have subdirs in it for the rest? It seems to me
that it would make debugging much easier. The uses of perl2host()
seem sufficient.
--
Michael

#54

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Michael Paquier (#53)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 9 Nov 2021 12:51:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Mon, Nov 08, 2021 at 05:55:16PM +0900, Kyotaro Horiguchi wrote:

I have quickly looked at the patch set.

0001: (I don't remember about this, though) I don't see how to make it
work on Windows. Anyway the next step would be to write comments.

Look at Utils.pm where we have dir_symlink, then. symlink() does not
work on WIN32, so we have a wrapper that uses junction points. FWIW,
I don't like much the behavior you are enforcing in init_from_backup
when coldly copying a source path, but I have not looked enough at the
patch set to have a strong opinion about this part, either.

Thanks for the info. If we can handle symlink on Windows, we don't
need to have a cold copy.

0002: I didn't see it in details and didn't check if it finds the
issue but it actually scceeds with the fix. The change to
poll_query_until is removed since it doesn't seem actually used.
+# Create tablespace
+my $dropme_ts_master1 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_master1 =
PostgreSQL::Test::Utils::perl2host($dropme_ts_master1);
+my $dropme_ts_master2 = PostgreSQL::Test::Utils::tempdir();
+$dropme_ts_master2 =
PostgreSQL::Test::Utils::perl2host($dropme_ts_master2);
+my $source_ts_master = PostgreSQL::Test::Utils::tempdir();
+$source_ts_master =
PostgreSQL::Test::Utils::perl2host($source_ts_master);
+my $target_ts_master = PostgreSQL::Test::Utils::tempdir();
+$target_ts_master =
PostgreSQL::Test::Utils::perl2host($target_ts_master);
Rather than creating N temporary directories, it would be simpler to
create only one, and have subdirs in it for the rest? It seems to me
that it would make debugging much easier. The uses of perl2host()
seem sufficient.

Thanks for the suggestion. My eyeballs got hopping around looking
that part so I gave up looking there in more detail:p I agree to that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#55

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#54)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 09 Nov 2021 17:05:49 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 9 Nov 2021 12:51:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in

Look at Utils.pm where we have dir_symlink, then. symlink() does not
work on WIN32, so we have a wrapper that uses junction points. FWIW,
I don't like much the behavior you are enforcing in init_from_backup
when coldly copying a source path, but I have not looked enough at the
patch set to have a strong opinion about this part, either.

Thanks for the info. If we can handle symlink on Windows, we don't
need to have a cold copy.

I bumped into the good-old 100-byte limit of the (v7?) tar format on
which pg_basebackup is depending. It is unlikely in the real world but
I think it is quite common in developping environment. The tablespace
directory path in my dev environment was 110 chacters-long. As small
as 10 bytes but it's quite annoying to chip off that number of bytes
from the path..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#56

Alvaro Herrera

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#55)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2021-Nov-10, Kyotaro Horiguchi wrote:

I bumped into the good-old 100-byte limit of the (v7?) tar format on
which pg_basebackup is depending. It is unlikely in the real world but
I think it is quite common in developping environment. The tablespace
directory path in my dev environment was 110 chacters-long. As small
as 10 bytes but it's quite annoying to chip off that number of bytes
from the path..

Can you use PostgreSQL::Test::Utils::tempdir_short() for those
tablespaces?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#57

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#56)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Wed, 10 Nov 2021 09:14:30 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

Can you use PostgreSQL::Test::Utils::tempdir_short() for those
tablespaces?

Thanks for the suggestion!

It works for a live cluster. But doesn't work for backups, since I
find no way to relate a tablespace directory with a backup directory
not using a symlink. One way would be taking a backup with tentative
tablespace directory in the short-named temporary directory then move
it into the backup direcotry. I'm going that way for now.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#58

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#57)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 11 Nov 2021 11:13:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 10 Nov 2021 09:14:30 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

Can you use PostgreSQL::Test::Utils::tempdir_short() for those
tablespaces?

Thanks for the suggestion!

It works for a live cluster. But doesn't work for backups, since I
find no way to relate a tablespace directory with a backup directory
not using a symlink. One way would be taking a backup with tentative
tablespace directory in the short-named temporary directory then move
it into the backup direcotry. I'm going that way for now.

This is that.

0001 adds several routines to handle tablespace directories, and adds
tablespace support to backup/_backup_fs.

We don't know an oid corresponding to a tablespace directory before
actually assigning the oid to the tablespace. So we cannot name a
tablespace directory after the oid. On the other hand, after defining
the tablespace, cold data files don't tell the real directory name of
the tablespace directory for an oid or a tablespace name, unless we
have readlink.

The function dir_readlink added to Utils.pm is that. Honestly I don't
like the way function works. It uses "cmd /c "dir /A:L $dir"" to
collect information of junctions. I'm not sure that the type label
"<JUNCTION>" is immutable among locales but at least it is shown as
"<JUNCTION>" on Japanese (CP-932) environment. I didn't actually
tested it on Windows and msys environment ...yet.

Premising the availability of the function, we can name tablespace
directories from meaningful words.

The directory to store tablespace directories can be a temporary
directory, but with that way it is needed to create a symlink to find
those directories from a backup. I chose to place tablespace
directories directly under backup directory.

The attached first file is a revised (or remade) version of tablespace
support for TAP test.

The second is the version adapted to the revised framework. (I
confirmed that the test actually detects the error.)

The third is not changed at all.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v14-0001-Add-tablespace-support-to-TAP-framework.patchtext/x-patch; charset=us-asciiDownload

From 5381df72dff0f326ffd20ae212bc43aa54ee8a86 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v14 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 262 ++++++++++++++++++++++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  43 ++++
 2 files changed, 303 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 9467a199c8..e195f11a23 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -287,6 +287,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -302,6 +360,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a hash from tablespace OID to tablespace directory name of
+tablespace directories that the specified backup has.  For example, an
+oid 16384 pointing to ../tsps/backup1/ts1 is stored as $hash{16384} =
+"ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -334,6 +463,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	print $fh "Install Path: ", $self->{_install_path} . "\n"
@@ -564,6 +694,43 @@ sub adjust_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -583,9 +750,24 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
 
 	local %ENV = $self->_get_env();
 
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = PostgreSQL::Test::Utils::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	PostgreSQL::Test::Utils::system_or_bail(
 		'pg_basebackup', '-D',
@@ -593,7 +775,32 @@ sub backup
 		$self->host,     '-p',
 		$self->port,     '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup directory.
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -655,11 +862,32 @@ sub _backup_fs
 	PostgreSQL::Test::RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -743,7 +971,37 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_path,
+			$data_path,
+			# Skipping tablespace symlinks
+			filterfn => sub {
+				my $src = shift;
+				return ($src !~ m!^pg_tblspc/[0-9]+$!);
+			});
+	}
+
+	# Copy tablespaces if any
+	my %tsps = $root_node->backup_tablespaces($backup_name);
+	if (%tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %tsps)
+		{
+			my $tsp = $tsps{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			PostgreSQL::Test::RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			PostgreSQL::Test::Utils::dir_symlink($tspdst, $linkdst);
+		}
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index f29d43f1f3..c3b5b4af34 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -725,6 +725,49 @@ sub dir_symlink
 
 =pod
 
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name = perl2host($name);
+		$name .= '/..';
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
+=pod
+
 =back
 
 =head1 Test::More-LIKE METHODS
-- 
2.27.0

v14-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From 69ca7d3657d9ea20d00a6486b0102899a4739a08 Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Thu, 11 Nov 2021 20:46:17 +0900
Subject: [PATCH v14 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 107 +++++++++++++++++++++-
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index d7806e6671..44254a7257 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -11,7 +11,7 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
 
-plan tests => 3;
+plan tests => 5;
 
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
@@ -62,3 +62,108 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = PostgreSQL::Test::Cluster->new('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.27.0

v14-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From 948e08998f106da269e792ca67b3aa8a22a258eb Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v14 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 +
 src/backend/access/transam/xlogutils.c | 145 +++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  55 ++++++++++
 src/backend/commands/tablespace.c      |   5 +
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 215 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e073121a7e..badda1deb2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8309,6 +8309,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e0531ed..99abf8b2f4 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab48df..0f483edb71 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2143,7 +2143,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2161,6 +2163,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2218,6 +2269,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4b96eec9df..0d5dfe007f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1527,6 +1527,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index eebc91f3a5..3341efc052 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#59

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#58)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Just a complaint..

At Fri, 12 Nov 2021 16:43:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

"<JUNCTION>" on Japanese (CP-932) environment. I didn't actually
tested it on Windows and msys environment ...yet.

Active perl cannot be installed because of (perhaps) a powershell
version issue... Annoying..

https://community.activestate.com/t/please-update-your-powershell-install-scripts/7897

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#60

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#59)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi,

On Fri, Dec 24, 2021 at 07:21:59PM +0900, Kyotaro Horiguchi wrote:

Just a complaint..

At Fri, 12 Nov 2021 16:43:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

"<JUNCTION>" on Japanese (CP-932) environment. I didn't actually
tested it on Windows and msys environment ...yet.

Active perl cannot be installed because of (perhaps) a powershell
version issue... Annoying..

https://community.activestate.com/t/please-update-your-powershell-install-scripts/7897

I'm not very familiar with windows, but maybe using strawberry perl instead
([1]https://strawberryperl.com/) would fix your problem? I think it's also quite popular and is commonly
used to run pgBadger on Windows.

Other than that, I see that the TAP tests are failing on all the environment,
due to Perl errors. For instance:

[04:06:00.848] [04:05:54] t/003_promote.pl .....
[04:06:00.848] Dubious, test returned 2 (wstat 512, 0x200)
https://api.cirrus-ci.com/v1/artifact/task/4751213722861568/tap/src/bin/pg_basebackup/tmp_check/log/regress_log_020_pg_receivewal
# Initializing node "standby" from backup "my_backup" of node "primary"
Odd number of elements in hash assignment at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 996.
Use of uninitialized value in list assignment at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 996.
Use of uninitialized value $tsp in concatenation (.) or string at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1008.
Use of uninitialized value $tsp in concatenation (.) or string at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1009.

That's apparently the same problem on every failure reported.

Can you send a fixed patchset? In the meantime I will switch the cf entry to
Waiting on Author.

[1]: https://strawberryperl.com/

#61

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#60)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Sun, 16 Jan 2022 12:43:03 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in

Hi,

On Fri, Dec 24, 2021 at 07:21:59PM +0900, Kyotaro Horiguchi wrote:

Just a complaint..

At Fri, 12 Nov 2021 16:43:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

"<JUNCTION>" on Japanese (CP-932) environment. I didn't actually
tested it on Windows and msys environment ...yet.

Active perl cannot be installed because of (perhaps) a powershell
version issue... Annoying..

https://community.activestate.com/t/please-update-your-powershell-install-scripts/7897

I'm not very familiar with windows, but maybe using strawberry perl instead
([1]) would fix your problem? I think it's also quite popular and is commonly
used to run pgBadger on Windows.

Thanks! I'll try it later.

Other than that, I see that the TAP tests are failing on all the environment,
due to Perl errors. For instance:

[04:06:00.848] [04:05:54] t/003_promote.pl .....
[04:06:00.848] Dubious, test returned 2 (wstat 512, 0x200)
https://api.cirrus-ci.com/v1/artifact/task/4751213722861568/tap/src/bin/pg_basebackup/tmp_check/log/regress_log_020_pg_receivewal
# Initializing node "standby" from backup "my_backup" of node "primary"
Odd number of elements in hash assignment at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 996.
Use of uninitialized value in list assignment at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 996.
Use of uninitialized value $tsp in concatenation (.) or string at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1008.
Use of uninitialized value $tsp in concatenation (.) or string at /tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1009.

That's apparently the same problem on every failure reported.

Can you send a fixed patchset? In the meantime I will switch the cf entry to
Waiting on Author.

I guess that failure came from a recent change that allows in-place
tablespace directory. I'll check it out. Thanks!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#62

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#61)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 17 Jan 2022 17:24:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Sun, 16 Jan 2022 12:43:03 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in

I'm not very familiar with windows, but maybe using strawberry perl instead
([1]) would fix your problem? I think it's also quite popular and is commonly
used to run pgBadger on Windows.

Thanks! I'll try it later.

Build is stopped by some unresolvable symbols.

Strawberry perl is 5.28, which doesn't expose new_ctype, new_collate
and new_numeric according the past discussion. (Active perl is 5.32).

/messages/by-id/20200501134711.08750c5f@antares.wagner.home

However, the patch provided revealed other around 70 unresolved symbol
errors...

# Hmm. perl on CentOS 8 is 5.26..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#63

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#60)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Sun, 16 Jan 2022 12:43:03 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in

Other than that, I see that the TAP tests are failing on all the environment,
due to Perl errors. For instance:

Perl seems to have changed its behavior for undef hash.

It is said that "if (%undef_hash)" is false but actually it is true
and "keys %undef_hash" is 1.. Finally I had to make
backup_tablespaces() to return a hash reference. The test of
pg_basebackup takes a backup with tar mode, which broke the test
infrastructure. Cluster::backup now skips symlink adjustment when the
backup contains "/base.tar".

I gave up testing on Windows on my own environment and used Cirrus CI.

# However, it works for confirmation of a established code. TAT of CI
# is still long to do trial and error of unestablished code..

This version works for Unixen but still doesn't for Windows. I'm
searching for a fix for Windows.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v15-0001-Add-tablespace-support-to-TAP-framework.patchtext/x-patch; charset=us-asciiDownload

From 5f88a80b9a585ca611ab6424f035330a47b2449f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v15 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   2 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm     | 264 ++++++++++++++++++-
 src/test/perl/PostgreSQL/Test/Utils.pm       |  43 +++
 3 files changed, 306 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f0243f28d4..c139b5e000 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -257,7 +257,7 @@ $node->safe_psql('postgres',
 	"CREATE TABLESPACE tblspc1 LOCATION '$realTsDir';");
 $node->safe_psql('postgres',
 	    "CREATE TABLE test1 (a int) TABLESPACE tblspc1;"
-	  . "INSERT INTO test1 VALUES (1234);");
+				 . "INSERT INTO test1 VALUES (1234);");
 $node->backup('tarbackup2', backup_options => ['-Ft']);
 # empty test1, just so that it's different from the to-be-restored data
 $node->safe_psql('postgres', "TRUNCATE TABLE test1;");
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b7d4c24553..d433ccf610 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -345,6 +474,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	print $fh "Install Path: ", $self->{_install_path} . "\n"
@@ -575,6 +705,43 @@ sub adjust_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -594,9 +761,24 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
 
 	local %ENV = $self->_get_env();
 
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = PostgreSQL::Test::Utils::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	PostgreSQL::Test::Utils::system_or_bail(
 		'pg_basebackup', '-D',
@@ -604,7 +786,33 @@ sub backup
 		$self->host,     '-p',
 		$self->port,     '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup
+	# directory, unless the backup is in tar mode.
+	if (%tsps && ! -f "$backup_path/base.tar")
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -666,11 +874,32 @@ sub _backup_fs
 	PostgreSQL::Test::RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -754,7 +983,38 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_path,
+			$data_path,
+			# Skipping tablespace symlinks
+			filterfn => sub {
+				my $src = shift;
+				return ($src !~ m!^pg_tblspc/[0-9]+$!);
+			});
+	}
+
+	# Copy tablespaces if any
+	my $tsps = $root_node->backup_tablespaces($backup_name);
+
+	if ($tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %{$tsps})
+		{
+			my $tsp = ${$tsps}{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			PostgreSQL::Test::RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			PostgreSQL::Test::Utils::dir_symlink($tspdst, $linkdst);
+		}
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 50be10fb5a..266f1c5aaf 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -724,6 +724,49 @@ sub dir_symlink
 
 =pod
 
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name = perl2host($name);
+		$name .= '/..';
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
+=pod
+
 =back
 
 =head1 Test::More-LIKE METHODS
-- 
2.27.0

v15-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From 0d4b1968c7bed11d47b169e8d4e2929db75c38b8 Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Thu, 11 Nov 2021 20:46:17 +0900
Subject: [PATCH v15 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 107 +++++++++++++++++++++-
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..421cf52dfe 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -11,7 +11,7 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
 
-plan tests => 3;
+plan tests => 5;
 
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
@@ -62,3 +62,108 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = PostgreSQL::Test::Cluster->new('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.27.0

v15-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From d82536792a6544fa082d4cde021e87f44854a2eb Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v15 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 +
 src/backend/access/transam/xlogutils.c | 145 +++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  55 ++++++++++
 src/backend/commands/tablespace.c      |   5 +
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 215 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d4cbf3ff..ec279c6158 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8314,6 +8314,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 90e1c48390..cd00e0f01e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 509d1a3e92..02b080e4ef 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2143,7 +2143,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2161,6 +2163,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2218,6 +2269,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index b2ccf5e06e..b2975a0bd2 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1565,6 +1565,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#64

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#63)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This version works for Unixen but still doesn't for Windows. I'm
searching for a fix for Windows.

And this version works for Windows. Maybe I've took a wrong version
to post. dir_readlink manipulated target file (junction) name in the
wrong way.

CI now likes this version for all platforms.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v15-0001-Add-tablespace-support-to-TAP-framework.patchtext/x-patch; charset=us-asciiDownload

From 0423d2b9aae0620c07b522632a8074ecd8ffef64 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v15 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   2 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm     | 264 ++++++++++++++++++-
 src/test/perl/PostgreSQL/Test/Utils.pm       |  42 +++
 3 files changed, 305 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f0243f28d4..c139b5e000 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -257,7 +257,7 @@ $node->safe_psql('postgres',
 	"CREATE TABLESPACE tblspc1 LOCATION '$realTsDir';");
 $node->safe_psql('postgres',
 	    "CREATE TABLE test1 (a int) TABLESPACE tblspc1;"
-	  . "INSERT INTO test1 VALUES (1234);");
+				 . "INSERT INTO test1 VALUES (1234);");
 $node->backup('tarbackup2', backup_options => ['-Ft']);
 # empty test1, just so that it's different from the to-be-restored data
 $node->safe_psql('postgres', "TRUNCATE TABLE test1;");
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b7d4c24553..d433ccf610 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -345,6 +474,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	print $fh "Install Path: ", $self->{_install_path} . "\n"
@@ -575,6 +705,43 @@ sub adjust_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -594,9 +761,24 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
 
 	local %ENV = $self->_get_env();
 
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = PostgreSQL::Test::Utils::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	PostgreSQL::Test::Utils::system_or_bail(
 		'pg_basebackup', '-D',
@@ -604,7 +786,33 @@ sub backup
 		$self->host,     '-p',
 		$self->port,     '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup
+	# directory, unless the backup is in tar mode.
+	if (%tsps && ! -f "$backup_path/base.tar")
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -666,11 +874,32 @@ sub _backup_fs
 	PostgreSQL::Test::RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -754,7 +983,38 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_path,
+			$data_path,
+			# Skipping tablespace symlinks
+			filterfn => sub {
+				my $src = shift;
+				return ($src !~ m!^pg_tblspc/[0-9]+$!);
+			});
+	}
+
+	# Copy tablespaces if any
+	my $tsps = $root_node->backup_tablespaces($backup_name);
+
+	if ($tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %{$tsps})
+		{
+			my $tsp = ${$tsps}{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			PostgreSQL::Test::RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			PostgreSQL::Test::Utils::dir_symlink($tspdst, $linkdst);
+		}
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 50be10fb5a..e0e5956e9b 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -724,6 +724,48 @@ sub dir_symlink
 
 =pod
 
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name = perl2host($name);
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
+=pod
+
 =back
 
 =head1 Test::More-LIKE METHODS
-- 
2.27.0

v15-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From e2772ce12fac4b552f26ad5c1c694766d3429170 Mon Sep 17 00:00:00 2001
From: "apraveen@pivotal.io" <apraveen@pivotal.io>
Date: Thu, 11 Nov 2021 20:46:17 +0900
Subject: [PATCH v15 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 107 +++++++++++++++++++++-
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..421cf52dfe 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -11,7 +11,7 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
 
-plan tests => 3;
+plan tests => 5;
 
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
@@ -62,3 +62,108 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = PostgreSQL::Test::Cluster->new('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
-- 
2.27.0

v15-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From ad9266ffdf4fdfc850218d6bb558127a2753f05c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v15 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlog.c      |   6 +
 src/backend/access/transam/xlogutils.c | 145 +++++++++++++++++++++++++
 src/backend/commands/dbcommands.c      |  55 ++++++++++
 src/backend/commands/tablespace.c      |   5 +
 src/include/access/xlogutils.h         |   4 +
 5 files changed, 215 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d4cbf3ff..ec279c6158 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8314,6 +8314,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 90e1c48390..cd00e0f01e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 509d1a3e92..02b080e4ef 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2143,7 +2143,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2161,6 +2163,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2218,6 +2269,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index b2ccf5e06e..b2975a0bd2 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1565,6 +1565,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#65

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#64)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 20 Jan 2022 17:19:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
CI now likes this version for all platforms.

An xlog.c refactoring happend recently hit this.
Just rebased on the change.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v16-0001-Add-tablespace-support-to-TAP-framework.patchtext/x-patch; charset=us-asciiDownload

From 35958b17c62cd14f81efa26a097c32c273028f77 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v16 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++++++++++++++++++++++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  43 ++++
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index be05845248..15d57b9a71 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -370,6 +499,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	print $fh "Install Path: ", $self->{_install_path} . "\n"
@@ -600,6 +730,43 @@ sub adjust_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -619,9 +786,24 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
 
 	local %ENV = $self->_get_env();
 
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = PostgreSQL::Test::Utils::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	PostgreSQL::Test::Utils::system_or_bail(
 		'pg_basebackup', '-D',
@@ -629,7 +811,33 @@ sub backup
 		$self->host,     '-p',
 		$self->port,     '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup
+	# directory, unless the backup is in tar mode.
+	if (%tsps && ! -f "$backup_path/base.tar")
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -691,11 +899,32 @@ sub _backup_fs
 	PostgreSQL::Test::RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -779,7 +1008,38 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_path,
+			$data_path,
+			# Skipping tablespace symlinks
+			filterfn => sub {
+				my $src = shift;
+				return ($src !~ m!^pg_tblspc/[0-9]+$!);
+			});
+	}
+
+	# Copy tablespaces if any
+	my $tsps = $root_node->backup_tablespaces($backup_name);
+
+	if ($tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %{$tsps})
+		{
+			my $tsp = ${$tsps}{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			PostgreSQL::Test::RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			PostgreSQL::Test::Utils::dir_symlink($tspdst, $linkdst);
+		}
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 46cd746796..6daac4ebdf 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -711,6 +711,49 @@ sub dir_symlink
 
 =pod
 
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name = perl2host($name);
+		$name .= '/..';
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
+=pod
+
 =back
 
 =head1 Test::More-LIKE METHODS
-- 
2.27.0

v16-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From 1f3d614bf6462876fd90166fcd9b7bf7e30e8c9e Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Thu, 11 Nov 2021 20:46:17 +0900
Subject: [PATCH v16 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 105 ++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..1998a321da 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -61,4 +61,109 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = PostgreSQL::Test::Cluster->new('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
+
 done_testing();
-- 
2.27.0

v16-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From b353dc4259baf022de2f6ce9a0301bf812d02ef2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v16 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlogrecovery.c |   6 +
 src/backend/access/transam/xlogutils.c    | 145 ++++++++++++++++++++++
 src/backend/commands/dbcommands.c         |  56 +++++++++
 src/backend/commands/tablespace.c         |   6 +
 src/include/access/xlogutils.h            |   4 +
 5 files changed, 217 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..97fed1e04d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..3f8f7dadac 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9a9a..8994e9da99 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -2382,7 +2383,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2400,6 +2403,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2462,6 +2514,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..62ee0ca978 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -57,6 +57,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1574,6 +1575,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#66

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#65)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Wed, 02 Mar 2022 16:59:09 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 20 Jan 2022 17:19:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
CI now likes this version for all platforms.

An xlog.c refactoring happend recently hit this.
Just rebased on the change.

A function added to Util.pm used perl2host, which has been removed
recently.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v17-0001-Add-tablespace-support-to-TAP-framework.patchtext/x-patch; charset=us-asciiDownload

From bb714659adcde5265974c46b061966e5dfc556be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v17 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++++++++++++++++++++++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  42 ++++
 2 files changed, 304 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index be05845248..15d57b9a71 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -370,6 +499,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	print $fh "Install Path: ", $self->{_install_path} . "\n"
@@ -600,6 +730,43 @@ sub adjust_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -619,9 +786,24 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
 
 	local %ENV = $self->_get_env();
 
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = PostgreSQL::Test::Utils::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	PostgreSQL::Test::Utils::system_or_bail(
 		'pg_basebackup', '-D',
@@ -629,7 +811,33 @@ sub backup
 		$self->host,     '-p',
 		$self->port,     '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup
+	# directory, unless the backup is in tar mode.
+	if (%tsps && ! -f "$backup_path/base.tar")
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -691,11 +899,32 @@ sub _backup_fs
 	PostgreSQL::Test::RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -779,7 +1008,38 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_path,
+			$data_path,
+			# Skipping tablespace symlinks
+			filterfn => sub {
+				my $src = shift;
+				return ($src !~ m!^pg_tblspc/[0-9]+$!);
+			});
+	}
+
+	# Copy tablespaces if any
+	my $tsps = $root_node->backup_tablespaces($backup_name);
+
+	if ($tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %{$tsps})
+		{
+			my $tsp = ${$tsps}{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			PostgreSQL::Test::RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			PostgreSQL::Test::Utils::dir_symlink($tspdst, $linkdst);
+		}
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 46cd746796..7f440c4662 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -711,6 +711,48 @@ sub dir_symlink
 
 =pod
 
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name .= '/..';
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
+=pod
+
 =back
 
 =head1 Test::More-LIKE METHODS
-- 
2.27.0

v17-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From 21da10267102dafe44afa2d99d8969a747e8e072 Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Thu, 11 Nov 2021 20:46:17 +0900
Subject: [PATCH v17 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 105 ++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..1998a321da 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -61,4 +61,109 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = PostgreSQL::Test::Cluster->new('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
+
 done_testing();
-- 
2.27.0

v17-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From 79201550374497555bf272a84c810e116412df3b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v17 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlogrecovery.c |   6 +
 src/backend/access/transam/xlogutils.c    | 145 ++++++++++++++++++++++
 src/backend/commands/dbcommands.c         |  56 +++++++++
 src/backend/commands/tablespace.c         |   6 +
 src/include/access/xlogutils.h            |   4 +
 5 files changed, 217 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..97fed1e04d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..3f8f7dadac 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9a9a..8994e9da99 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -2382,7 +2383,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2400,6 +2403,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2462,6 +2514,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..62ee0ca978 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -57,6 +57,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1574,6 +1575,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#67

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#66)

3 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Wed, 02 Mar 2022 19:31:24 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

A function added to Util.pm used perl2host, which has been removed
recently.

And same function contained a maybe-should-have-been-removed line
which makes Windows build unhappy.

This should make all platforms in the CI happy.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v18-0001-Add-tablespace-support-to-TAP-framework.patchtext/x-patch; charset=us-asciiDownload

From ee17f0f4400ce484cdba80c84744ae47d68c6fa4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v18 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++++++++++++++++++++++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  42 ++++
 2 files changed, 304 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index be05845248..15d57b9a71 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -370,6 +499,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	print $fh "Install Path: ", $self->{_install_path} . "\n"
@@ -600,6 +730,43 @@ sub adjust_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -619,9 +786,24 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
 
 	local %ENV = $self->_get_env();
 
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = PostgreSQL::Test::Utils::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	PostgreSQL::Test::Utils::system_or_bail(
 		'pg_basebackup', '-D',
@@ -629,7 +811,33 @@ sub backup
 		$self->host,     '-p',
 		$self->port,     '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup
+	# directory, unless the backup is in tar mode.
+	if (%tsps && ! -f "$backup_path/base.tar")
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -691,11 +899,32 @@ sub _backup_fs
 	PostgreSQL::Test::RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			PostgreSQL::Test::Utils::dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -779,7 +1008,38 @@ sub init_from_backup
 	else
 	{
 		rmdir($data_path);
-		PostgreSQL::Test::RecursiveCopy::copypath($backup_path, $data_path);
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$backup_path,
+			$data_path,
+			# Skipping tablespace symlinks
+			filterfn => sub {
+				my $src = shift;
+				return ($src !~ m!^pg_tblspc/[0-9]+$!);
+			});
+	}
+
+	# Copy tablespaces if any
+	my $tsps = $root_node->backup_tablespaces($backup_name);
+
+	if ($tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %{$tsps})
+		{
+			my $tsp = ${$tsps}{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			PostgreSQL::Test::RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			PostgreSQL::Test::Utils::dir_symlink($tspdst, $linkdst);
+		}
 	}
 	chmod(0700, $data_path);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 46cd746796..7f440c4662 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -711,6 +711,48 @@ sub dir_symlink
 
 =pod
 
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name .= '/..';
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
+=pod
+
 =back
 
 =head1 Test::More-LIKE METHODS
-- 
2.27.0

v18-0002-Tests-to-replay-create-database-operation-on-sta.patchtext/x-patch; charset=us-asciiDownload

From 907d295d4e9823b8c51818272b10f0fe518eff8e Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Thu, 11 Nov 2021 20:46:17 +0900
Subject: [PATCH v18 2/3] Tests to replay create database operation on standby

The tests demonstrate that standby fails to replay a create database
WAL record during crash recovery, if one or more of underlying
directories are missing from the file system.  This can happen if a
drop tablespace or drop database WAL record has been replayed in
archive recovery, before a crash.  And then the create database record
happens to be replayed again during crash recovery.  The failures
indicate bugs that need to be fixed.

The first test, TEST 4, performs several DDL operations resulting in a
database directory being removed, along with a few create database
operations.  It expects crash recovery to succeed because for each
missing directory encountered during create database replay, a matching
drop tablespace or drop database WAL record is found later.

Second test, TEST 5, validates that a standby rightfully aborts replay
during archive recovery, if a missing directory is encountered when
replaying create database WAL record.

These tests have been proposed and implemented in various ways by
Alexandra Wang, Anastasia Lubennikova, Kyotaro Horiguchi, Paul Guo and me.
---
 src/test/recovery/t/011_crash_recovery.pl | 105 ++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..1998a321da 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -61,4 +61,109 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = PostgreSQL::Test::Cluster->new('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
+
 done_testing();
-- 
2.27.0

v18-0003-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From 5d335b2b62d9c38b5fd80895c839efc843a5de6d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 9 Jan 2020 17:54:40 -0300
Subject: [PATCH v18 3/3] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlogrecovery.c |   6 +
 src/backend/access/transam/xlogutils.c    | 145 ++++++++++++++++++++++
 src/backend/commands/dbcommands.c         |  56 +++++++++
 src/backend/commands/tablespace.c         |   6 +
 src/include/access/xlogutils.h            |   4 +
 5 files changed, 217 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..97fed1e04d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..3f8f7dadac 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9a9a..8994e9da99 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -2382,7 +2383,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2400,6 +2403,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2462,6 +2514,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..62ee0ca978 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -57,6 +57,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1574,6 +1575,11 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
-- 
2.27.0

#68

Michael Paquier

michael@paquier.xyz

almost 4 years ago

In reply to: Kyotaro Horiguchi (#67)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Mar 04, 2022 at 09:10:48AM +0900, Kyotaro Horiguchi wrote:

And same function contained a maybe-should-have-been-removed line
which makes Windows build unhappy.

This should make all platforms in the CI happy.

d6d317d as solved the issue of tablespace paths across multiple nodes
with the new GUC called allow_in_place_tablespaces, and is getting
successfully used in the recovery tests as of 027_stream_regress.pl.

Shouldn't we rely on that rather than extending more our test perl
modules? One tricky part is the emulation of readlink for junction
points on Windows (dir_readlink in your patch), and the root of the
problem is that 0003 cares about the path structure of the
tablespaces so we have no need, as far as I can see, for any
dependency with link follow-up in the scope of this patch.

This means that you should be able to simplify the patch set, as we
could entirely drop 0001 in favor of enforcing the new dev GUC in the
nodes created in the TAP test of 0002.

Speaking of 0002, perhaps this had better be in its own file rather
than extending more 011_crash_recovery.pl. 0003 looks like a good
idea to check after the consistency of the path structures created
during replay, and it touches paths I'd expect it to touch, as of
database and tbspace redos.

+       if (!reachedConsistency)
+           XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+       XLogFlush(record->EndRecPtr);
Not sure to understand why this is required.  A comment may be in
order to explain the hows and the whys.
--
Michael

#69

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Michael Paquier (#68)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thanks to look this!

At Fri, 4 Mar 2022 13:51:12 +0900, Michael Paquier <michael@paquier.xyz> wrote i
n

On Fri, Mar 04, 2022 at 09:10:48AM +0900, Kyotaro Horiguchi wrote:

And same function contained a maybe-should-have-been-removed line
which makes Windows build unhappy.

This should make all platforms in the CI happy.

d6d317d as solved the issue of tablespace paths across multiple nodes
with the new GUC called allow_in_place_tablespaces, and is getting
successfully used in the recovery tests as of 027_stream_regress.pl.

The feature allows only one tablespace directory. but that uses (I'm
not sure it needs, though) multiple tablespace directories so I think
the feature doesn't work for the test.

Maybe I'm missing something, but it doesn't use tablespaces. I see
that in 002_tablespace.pl but but the test uses only one tablespace
location.

Shouldn't we rely on that rather than extending more our test perl
modules? One tricky part is the emulation of readlink for junction
points on Windows (dir_readlink in your patch), and the root of the

Yeah, I don't like that as I said before...

problem is that 0003 cares about the path structure of the
tablespaces so we have no need, as far as I can see, for any
dependency with link follow-up in the scope of this patch.

I'm not sure how this related to 0001 but maybe I don't follow this.

This means that you should be able to simplify the patch set, as we
could entirely drop 0001 in favor of enforcing the new dev GUC in the
nodes created in the TAP test of 0002.

Maybe it's possible by breaking the test into ones that need only one
tablespace. I'll give it a try.

Speaking of 0002, perhaps this had better be in its own file rather
than extending more 011_crash_recovery.pl. 0003 looks like a good

Ok, no problem.

idea to check after the consistency of the path structures created
during replay, and it touches paths I'd expect it to touch, as of
database and tbspace redos.
+       if (!reachedConsistency)
+           XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+       XLogFlush(record->EndRecPtr);
Not sure to understand why this is required.  A comment may be in
order to explain the hows and the whys.

Is it about XLogFlush? As my understanding it is to update
minRecoveryPoint to that LSN. I'll add a comment like that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#70

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#69)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

So the new framework has been dropped in this version.
The second test is removed as it is irrelevant to this bug.

In this version the patch is a single file that contains the test.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v20-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From 43bb3ba8900edd53a1feb0acb1a72bdc22bb1627 Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Mon, 7 Mar 2022 17:10:07 +0900
Subject: [PATCH v20] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlogrecovery.c   |   6 +
 src/backend/access/transam/xlogutils.c      | 145 ++++++++++++++++++++
 src/backend/commands/dbcommands.c           |  56 ++++++++
 src/backend/commands/tablespace.c           |  16 +++
 src/include/access/xlogutils.h              |   4 +
 src/test/recovery/t/029_replay_tsp_drops.pl |  62 +++++++++
 6 files changed, 289 insertions(+)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..97fed1e04d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..3f8f7dadac 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									   100,
+									   &ctl,
+									   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+				 path, spcNode);
+		else
+			elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+	}
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+		}
+		else
+		{
+			char *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9a9a..8994e9da99 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -2382,7 +2383,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool	    skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2400,6 +2403,55 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that drop tablespace record appearing later in
+			 * the WAL as already been replayed.  That means we are replaying
+			 * the create database record second time, as part of crash
+			 * recovery.  In that case, the tablespace directory has already
+			 * been removed and the create database operation cannot be
+			 * replayed.  We should skip the replay but remember the missing
+			 * tablespace directory, to be matched with a drop tablespace
+			 * record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping create database WAL record"),
+						 errdetail("Target tablespace \"%s\" not found. We "
+								   "expect to encounter a WAL record that "
+								   "removes this directory before reaching "
+								   "consistent state.", parent_path)));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * Source directory may be missing.  E.g. the template database used
+		 * for creating this database may have been dropped, due to reasons
+		 * noted above.  Moving a database from one tablespace may also be a
+		 * partner in the crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping create database WAL record"),
+					 errdetail("Source database \"%s\" not found. We expect "
+							   "to encounter a WAL record that removes this "
+							   "directory before reaching consistent state.",
+							   src_path)));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2462,6 +2514,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..66bd28fc74 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -57,6 +57,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1574,6 +1575,21 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		/*
+		 * Before we remove the tablespace directory, update minimum recovery
+		 * point to cover this WAL record. Once the tablespace is removed,
+		 * there's no going back.  This manually enforces the WAL-first rule.
+		 * Doing this before the removal means that if the removal fails for
+		 * some reason, the directory is left alone and needs to be manually
+		 * removed.  Alternatively you could update the minimum recovery point
+		 * after removal, but that would leave a small window where the
+		 * WAL-first rule could be violated.
+		 */
+		XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..de2a92661c
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,62 @@
+# Copyright (c) 2022, PostgreSQL Global Development Group
+
+# Test recovery involving tablespace droppings.  If recovery stops
+# after once tablespace is removed, the next recovery should properly
+# ignore the operations within the removed tablespaces.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+#use File::Compare;
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary1');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+$node_primary->psql('postgres',
+qq[
+	SET allow_in_place_tablespaces=on;
+	CREATE TABLESPACE dropme_ts1 LOCATION '';
+	CREATE TABLESPACE dropme_ts2 LOCATION '';
+	CREATE TABLESPACE source_ts  LOCATION '';
+	CREATE TABLESPACE target_ts  LOCATION '';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# Ensure that a missing tablespace directory during create database
+done_testing();
-- 
2.27.0

#71

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#70)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Mar 7, 2022 at 3:39 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

So the new framework has been dropped in this version.
The second test is removed as it is irrelevant to this bug.

In this version the patch is a single file that contains the test.

The status of this patch in the CommitFest was set to "Waiting for
Author." Since a new patch has been submitted since that status was
set, I have changed it to "Needs Review." Since this is now in its
15th CommitFest, we really should get it fixed; that's kind of
ridiculous. (I am as much to blame as anyone.) It does seem to be a
legitimate bug.

A few questions about the patch:

1. Why is it OK to just skip the operation without making it up later?

2. Why not instead change the code so that the operation can succeed,
by creating the prerequisite parent directories? Do we not have enough
information for that? I'm not saying that we definitely should do it
that way rather than this way, but I think we do take that approach in
some cases.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#72

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Robert Haas (#71)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 14 Mar 2022 17:37:40 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Mon, Mar 7, 2022 at 3:39 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

So the new framework has been dropped in this version.
The second test is removed as it is irrelevant to this bug.

In this version the patch is a single file that contains the test.

The status of this patch in the CommitFest was set to "Waiting for
Author." Since a new patch has been submitted since that status was
set, I have changed it to "Needs Review." Since this is now in its
15th CommitFest, we really should get it fixed; that's kind of
ridiculous. (I am as much to blame as anyone.) It does seem to be a
legitimate bug.

A few questions about the patch:

Thanks for looking this!

1. Why is it OK to just skip the operation without making it up later?

Does "it" mean removal of directories? It is not okay, but in the
first place it is out-of-scope of this patch to fix that. The patch
leaves the existing code alone. This patch just has recovery ignore
invalid accesses into eventually removed objects.

Maybe, I don't understand you question..

2. Why not instead change the code so that the operation can succeed,
by creating the prerequisite parent directories? Do we not have enough
information for that? I'm not saying that we definitely should do it
that way rather than this way, but I think we do take that approach in
some cases.

It is proposed first by Paul Guo [1]/messages/by-id/20210327142316.GA32517@alvherre.pgsql then changed so that it ignores
failed directory creations in the very early stage in this thread.
After that, it gets conscious of recovery consistency by managing
invalid-access list.

[1]: /messages/by-id/20210327142316.GA32517@alvherre.pgsql

I think there was no strong reason for the current shape but I
personally rather like the remembering-invalid-access way because it
doesn't dirty the data directory and it is consistent with how we
treat missing heap pages.

I tried a slightly tweaked version (attached) of the first version and
confirmed that it works for the current test script. It doesn't check
recovery consistency but otherwise that way also seems fine.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

create_missing_directories_while_recovery.txttext/plain; charset=us-asciiDownload

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9a9a..28aed8d296 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,6 +47,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2382,6 +2383,7 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
@@ -2401,6 +2403,41 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		/*
+		 * It is possible that the tablespace was later dropped, but we are
+		 * re-redoing database create before that. In that case, those
+		 * directories are gone, and we do not create symlink.
+		 */
+		if (stat(dst_path, &st) < 0 && errno == ENOENT)
+		{
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			elog(WARNING, "creating missing directory: %s", parent_path);
+			if (stat(parent_path, &st) != 0 && pg_mkdir_p(parent_path, pg_dir_create_mode) != 0)
+			{
+				ereport(WARNING,
+						(errmsg("can not recursively create directory \"%s\"",
+								parent_path)));
+			}
+		}
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+		{
+			elog(WARNING, "creating missing copy source directory: %s", src_path);
+			if (stat(src_path, &st) != 0 && pg_mkdir_p(src_path, pg_dir_create_mode) != 0)
+			{
+				ereport(WARNING,
+						(errmsg("can not recursively create directory \"%s\"",
+								src_path)));
+			}
+		}
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..675f578dfe 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -155,8 +155,6 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -169,32 +167,8 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",

#73

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Michael Paquier (#68)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Mar-04, Michael Paquier wrote:

d6d317d as solved the issue of tablespace paths across multiple nodes
with the new GUC called allow_in_place_tablespaces, and is getting
successfully used in the recovery tests as of 027_stream_regress.pl.

OK, but that means that the test suite is now not backpatchable. The
implication here is that either we're going to commit the fix without
any tests at all on older branches, or that we're going to fix it only
in branch master. Are you thinking that it's okay to leave this bug
unfixed in older branches? That seems embarrasing.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto,
no me acuerdo." (Augusto Pinochet a una corte de justicia)

#74

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Kyotaro Horiguchi (#70)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

I had a look at this latest version of the patch, and found some things
to tweak. Attached is v21 with three main changes from Kyotaro's v20:

1. the XLogFlush is only done if consistent state has not been reached.
As I understand, it's not needed in normal mode. (In any case, if we do
call XLogFlush in normal mode, what it does is not advance the recovery
point, so the comment would be incorrect.)

2. use %u to print OIDs rather than %d

3. I changed the warning message wording to this:

+           ereport(WARNING,
+                   (errmsg("skipping replay of database creation WAL record"),
+                    errdetail("The source database directory \"%s\" was not found.",
+                              src_path),
+                    errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));

I also renamed the function XLogReportMissingDir to
XLogRememberMissingDir (which matches the "forget" part) and changed the
DEBUG2 messages in that function to DEBUG1 (all the calls in other
functions remain DEBUG2, because ISTM they are not as interesting).
Finally, I made the TAP test search the WARNING line in the log.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
(Carlos Caszeli)

Attachments:

v21-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-diff; charset=utf-8Download

From 6a6fc73a93768a44ec026720c115f77c67d5cda2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 21 Mar 2022 12:34:34 +0100
Subject: [PATCH v21] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c   |   6 +
 src/backend/access/transam/xlogutils.c      | 159 +++++++++++++++++++-
 src/backend/commands/dbcommands.c           |  57 +++++++
 src/backend/commands/tablespace.c           |  17 +++
 src/include/access/xlogutils.h              |   4 +
 src/test/recovery/t/029_replay_tsp_drops.pl |  67 +++++++++
 src/tools/pgindent/typedefs.list            |   2 +
 7 files changed, 311 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..f48d8d51fb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..8c1b8216be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -54,6 +54,164 @@ bool		InRecovery = false;
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
 HotStandbyState standbyState = STANDBY_DISABLED;
 
+
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid			spcNode;
+	Oid			dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char		path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+
+/*
+ * Keep track of a directory that wasn't found while replaying database
+ * creation records.  These should match up with tablespace removal records
+ * later in the WAL stream; we verify that before reaching consistency.
+ */
+void
+XLogRememberMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool		found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									  100,
+									  &ctl,
+									  HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG1, "missing directory %s (tablespace %u) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG1, "missing directory %s (tablespace %u database %u) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG1, "logged missing dir %s (tablespace %u)",
+				 path, spcNode);
+		else
+			elog(DEBUG1, "logged missing dir %s (tablespace %u database %u)",
+				 path, spcNode, dbNode);
+	}
+}
+
+/*
+ * Remove an entry from the list of directories not found.  This is to be done
+ * when the matching tablespace removal WAL record is found.
+ */
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %u)", spcNode);
+		}
+		else
+		{
+			char	   *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %u database %u)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %u database %u",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
+
+
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
  * pages that no longer exist, because their relation was later dropped or
@@ -79,7 +237,6 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
-
 /* Report a reference to an invalid page */
 static void
 report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec778..95771b06a2 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -2483,7 +2484,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool		skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2501,6 +2504,56 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that a drop tablespace record appearing later in
+			 * WAL has already been replayed -- in other words, that we are
+			 * replaying the database creation record a second time with no
+			 * intervening checkpoint.  In that case, the tablespace directory
+			 * has already been removed and the create database operation
+			 * cannot be replayed.  Skip the replay itself, but remember the
+			 * fact that the tablespace directory is missing, to be matched
+			 * with the expected tablespace drop record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogRememberMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping replay of database creation WAL record"),
+						 errdetail("The target tablespace \"%s\" directory was not found.",
+								   parent_path),
+						 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * If the source directory is missing, skip the copy and make a note of
+		 * it for later.
+		 *
+		 * One possible reason for this is that the template database used for
+		 * creating this database may have been dropped, as noted above.
+		 * Moving a database from one tablespace may also be a partner in the
+		 * crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogRememberMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping replay of database creation WAL record"),
+					 errdetail("The source database directory \"%s\" was not found.",
+							   src_path),
+					 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2563,6 +2616,10 @@ dbase_redo(XLogReaderState *record)
 				ereport(WARNING,
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
+
+			if (!reachedConsistency)
+				XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
 			pfree(dst_path);
 		}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..55f40831da 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -57,6 +57,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1574,6 +1575,22 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		/*
+		 * Before we remove the tablespace directory, update minimum recovery
+		 * point to cover this WAL record. Once the tablespace is removed,
+		 * there's no going back.  This manually enforces the WAL-first rule.
+		 * Doing this before the removal means that if the removal fails for
+		 * some reason, the directory is left alone and needs to be manually
+		 * removed.  Alternatively we could update the minimum recovery point
+		 * after removal, but that would leave a small window where the
+		 * WAL-first rule could be violated.
+		 */
+		if (!reachedConsistency)
+			XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..8d48f003b0 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogRememberMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..90a72be489
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2022, PostgreSQL Global Development Group
+
+# Test recovery involving tablespace removal.  If recovery stops
+# after once tablespace is removed, the next recovery should properly
+# ignore the operations within the removed tablespaces.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary1');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+$node_primary->psql('postgres',
+qq[
+	SET allow_in_place_tablespaces=on;
+	CREATE TABLESPACE dropme_ts1 LOCATION '';
+	CREATE TABLESPACE dropme_ts2 LOCATION '';
+	CREATE TABLESPACE source_ts  LOCATION '';
+	CREATE TABLESPACE target_ts  LOCATION '';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREATE DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start, 1, "standby started successfully");
+
+my $log = PostgreSQL::Test::Utils::slurp_file($node_standby->logfile);
+like(
+	$log,
+	qr[WARNING:  skipping replay of database creation WAL record],
+	"warning message is logged");
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190508..4d58159b18 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3736,6 +3736,8 @@ xl_invalid_page
 xl_invalid_page_key
 xl_invalidations
 xl_logical_message
+xl_missing_dir_key
+xl_missing_dir
 xl_multi_insert_tuple
 xl_multixact_create
 xl_multixact_truncate
-- 
2.30.2

#75

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Robert Haas (#71)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Mar-14, Robert Haas wrote:

2. Why not instead change the code so that the operation can succeed,
by creating the prerequisite parent directories? Do we not have enough
information for that? I'm not saying that we definitely should do it
that way rather than this way, but I think we do take that approach in
some cases.

It seems we can choose freely between these two implementations -- I
mean I don't see any upsides or downsides to either one.

The current one has the advantage that it never makes the datadir
"dirty", to use Kyotaro's term. It verifies that the creation/drop form
a pair. A possible downside is that if there's a bug, we could end up
with a spurious PANIC at the end of recovery, and no way to recover.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#76

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Alvaro Herrera (#74)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Mar-21, Alvaro Herrera wrote:

I had a look at this latest version of the patch, and found some things
to tweak. Attached is v21 with three main changes from Kyotaro's v20:

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

This is quite annoying, considering that the bug was reported shortly
before 12 went into beta.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it." (New York Times, about Microsoft PowerPoint)

#77

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#76)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Fri, 25 Mar 2022 13:26:05 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2022-Mar-21, Alvaro Herrera wrote:

I had a look at this latest version of the patch, and found some things
to tweak. Attached is v21 with three main changes from Kyotaro's v20:

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

Thaks for committing.

This is quite annoying, considering that the bug was reported shortly
before 12 went into beta.

Sure. I'm going to look into that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#78

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#77)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Mar 28, 2022 at 2:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Fri, 25 Mar 2022 13:26:05 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

Just a thought: we could consider back-patching
allow_in_place_tablespaces, after a little while, if we're happy with
how that is working out, if it'd be useful for verifying bug fixes in
back branches. It's non-end-user-facing testing infrastructure.

#79

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#78)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 28 Mar 2022 14:34:44 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

On Mon, Mar 28, 2022 at 2:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Fri, 25 Mar 2022 13:26:05 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

Just a thought: we could consider back-patching
allow_in_place_tablespaces, after a little while, if we're happy with
how that is working out, if it'd be useful for verifying bug fixes in
back branches. It's non-end-user-facing testing infrastructure.

I appreciate if we accept that. The patch is simple. And it now has
the clear use-case for back-patching.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#80

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#77)

2 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 28 Mar 2022 10:01:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 25 Mar 2022 13:26:05 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

Thaks for committing.

This is quite annoying, considering that the bug was reported shortly
before 12 went into beta.

Sure. I'm going to look into that.

This is a preparatory patch and tentative (yes, it's just tentative)
test. This is made for 12 but applies with some warnings to 10-11.

(Hope the attachments are attached as "attachment", not "inline".)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Tentative-test-for-tsp-replay-fix_12.txttext/plain; charset=us-asciiDownload

From 3d5b24691517c1aac4b49728abb122c66a4e33be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 28 Mar 2022 16:29:04 +0900
Subject: [PATCH 1/2] Tentative test for tsp replay fix

---
 src/test/perl/PostgresNode.pm             | 342 +++++++++++++++++++++-
 src/test/recovery/t/011_crash_recovery.pl | 108 ++++++-
 2 files changed, 447 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 7b2ec29bb7..88fa08b61d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -104,6 +104,8 @@ use TestLib ();
 use Time::HiRes qw(usleep);
 use Scalar::Util qw(blessed);
 
+my $windows_os = 0;
+
 our @EXPORT = qw(
   get_new_node
   get_free_port
@@ -323,6 +325,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = TestLib::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -338,6 +398,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->info()
 
 Return a string containing human-readable diagnostic information (paths, etc)
@@ -354,6 +485,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory: " . $self->tablespace_storage . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
 	close $fh or die;
@@ -536,6 +668,43 @@ sub append_conf
 
 =pod
 
+=item $node->new_tablespace(name)
+
+Create a tablespace directory with the name then returns the path.
+
+=cut
+
+sub new_tablespace
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+
+	die "tablespace \"$name\" already exists" if (!mkdir($path));
+
+	return $path;
+}
+
+=pod
+
+=item $node->tablespace_dir(name)
+
+Return the path of the existing tablespace with the name.
+
+=cut
+
+sub tablespace_dir
+{
+	my ($self, $name) = @_;
+
+	my $path = $self->tablespace_storage . '/' . $name;
+	return undef if (!-d $path);
+
+	return $path;
+}
+
+=pod
+
 =item $node->backup(backup_name)
 
 Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
@@ -555,13 +724,54 @@ sub backup
 	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name        = $self->name;
+	my @tsp_maps;
+
+	# Build tablespace mappings.  We once let pg_basebackup copy
+	# tablespaces into temporary tablespace storage with a short name
+	# so that we can work on pathnames that fit our tar format which
+	# pg_basebackup depends on.
+	my $map_src_root = $self->tablespace_storage(1);
+	my $backup_tmptsp_root = TestLib::tempdir_short();
+	my %tsps = $self->tablespaces();
+	foreach my $tspname (values %tsps)
+	{
+		my $src = "$map_src_root/$tspname";
+		my $dst = "$backup_tmptsp_root/$tspname";
+		push(@tsp_maps, "--tablespace-mapping=$src=$dst");
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail(
 		'pg_basebackup', '-D', $backup_path, '-h',
 		$self->host,     '-p', $self->port,  '--checkpoint',
 		'fast',          '--no-sync',
+		@tsp_maps,
 		@{ $params{backup_options} });
+
+	# Move the tablespaces from temporary storage into backup
+	# directory, unless the backup is in tar mode.
+	if (%tsps && ! -f "$backup_path/base.tar")
+	{
+		$self->backup_create_tablespace_storage();
+		RecursiveCopy::copypath(
+			$backup_tmptsp_root,
+			$self->backup_tablespace_storage_path($backup_name));
+		# delete the temporary directory right away
+		rmtree $backup_tmptsp_root;
+
+		# Fix tablespace symlinks.  This is not necessarily required
+		# in backups but keep them consistent.
+		my $linkdst_root = "$backup_path/pg_tblspc";
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			unlink $tspdst;
+			dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	print "# Backup finished\n";
 	return;
 }
@@ -623,11 +833,32 @@ sub _backup_fs
 	RecursiveCopy::copypath(
 		$self->data_dir,
 		$backup_path,
+		# Skipping some files and tablespace symlinks
 		filterfn => sub {
 			my $src = shift;
-			return ($src ne 'log' and $src ne 'postmaster.pid');
+			return ($src ne 'log' and $src ne 'postmaster.pid' and
+					$src !~ m!^pg_tblspc/[0-9]+$!);
 		});
 
+	# Copy tablespaces if any
+	my %tsps = $self->tablespaces();
+	if (%tsps)
+	{
+		$self->backup_create_tablespace_storage();
+		RecursiveCopy::copypath(
+			$self->tablespace_storage,
+			$self->backup_tablespace_storage_path($backup_name));
+
+		my $linkdst_root = $backup_path . '/pg_tblspc';
+		my $linksrc_root = $self->backup_tablespace_storage_path($backup_name);
+		foreach my $oid (keys %tsps)
+		{
+			my $tspdst = "$linkdst_root/$oid";
+			my $tspsrc = "$linksrc_root/" . $tsps{$oid};
+			dir_symlink($tspsrc, $tspdst);
+		}
+	}
+
 	if ($hot)
 	{
 
@@ -645,6 +876,80 @@ sub _backup_fs
 
 
 
+=pod
+
+=item dir_symlink(oldname, newname)
+
+Portably create a symlink for a directory. On Windows this creates a junction
+point. Elsewhere it just calls perl's builtin symlink.
+
+=cut
+
+sub dir_symlink
+{
+	my $oldname = shift;
+	my $newname = shift;
+	if ($windows_os)
+	{
+		$oldname =~ s,/,\\,g;
+		$newname =~ s,/,\\,g;
+		my $cmd = qq{mklink /j "$newname" "$oldname"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+		system($cmd);
+	}
+	else
+	{
+		symlink $oldname, $newname;
+	}
+	die "No $newname" unless -e $newname;
+}
+
+=pod
+
+=item dir_readlink(name)
+
+Portably read a symlink for a directory. On Windows this reads a junction
+point. Elsewhere it just calls perl's builtin readlink.
+
+=cut
+
+sub dir_readlink
+{
+	my $name = shift;
+	if ($windows_os)
+	{
+		$name .= '/..';
+		$name =~ s,/,\\,g;
+		# Split the path into parent directory and link name
+		die "invalid path spec: $name" if ($name !~ m!^(.*)\\([^\\]+)\\?$!);
+		my ($dir, $fname) = ($1, $2);
+		my $cmd = qq{cmd /c "dir /A:L $dir"};
+		if ($Config{osname} eq 'msys')
+		{
+			# need some indirection on msys
+			$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
+		}
+
+		my $result;
+		foreach my $l (split /[\r\n]+/, `$cmd`)
+		{
+			$result = $1 if ($l =~ m/<JUNCTION>\W+$fname \[(.*)\]/)
+		}
+		die "junction $name not found" if (!defined $result);
+
+		$name =~ s,\\,/,g;
+		return $result;
+	}
+	else
+	{
+		return readlink $name;
+	}
+}
+
 =pod
 
 =item $node->init_from_backup(root_node, backup_name)
@@ -689,7 +994,40 @@ sub init_from_backup
 
 	my $data_path = $self->data_dir;
 	rmdir($data_path);
-	RecursiveCopy::copypath($backup_path, $data_path);
+
+	RecursiveCopy::copypath(
+		$backup_path,
+		$data_path,
+		# Skipping tablespace symlinks
+		filterfn => sub {
+			my $src = shift;
+			return ($src !~ m!^pg_tblspc/[0-9]+$!);
+		});
+
+	# Copy tablespaces if any
+	my $tsps = $root_node->backup_tablespaces($backup_name);
+
+	if ($tsps)
+	{
+		my $tsp_src = $root_node->backup_tablespace_storage_path($backup_name);
+		my $tsp_dst = $self->tablespace_storage();
+		my $linksrc_root = $data_path . '/pg_tblspc';
+
+		# copypath() rejects to copy into existing directory.
+		# Copy individual directories in the storage.
+		foreach my $oid (keys %{$tsps})
+		{
+			my $tsp = ${$tsps}{$oid};
+			my $tspsrc = "$tsp_src/$tsp";
+			my $tspdst = "$tsp_dst/$tsp";
+			RecursiveCopy::copypath($tspsrc, $tspdst);
+
+			# Create tablespace symlink for this tablespace
+			my $linkdst = "$linksrc_root/$oid";
+			dir_symlink($tspdst, $linkdst);
+		}
+	}
+
 	chmod(0700, $data_path);
 
 	# Base configuration for this node
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 5dc52412ca..30aaf763e5 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 5;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,109 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+my $node_primary = get_new_node('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+my $dropme_ts_primary1 = $node_primary->new_tablespace('dropme_ts1');
+my $dropme_ts_primary2 = $node_primary->new_tablespace('dropme_ts2');
+my $soruce_ts_primary = $node_primary->new_tablespace('source_ts');
+my $target_ts_primary = $node_primary->new_tablespace('target_ts');
+
+$node_primary->psql('postgres',
+qq[
+	CREATE TABLESPACE dropme_ts1 LOCATION '$dropme_ts_primary1';
+	CREATE TABLESPACE dropme_ts2 LOCATION '$dropme_ts_primary2';
+	CREATE TABLESPACE source_ts  LOCATION '$soruce_ts_primary';
+	CREATE TABLESPACE target_ts  LOCATION '$target_ts_primary';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = get_new_node('standby2');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+	'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+						q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+						  CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+						  CREATE DATABASE moveme_db TABLESPACE source_ts;
+						  ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+						  CREATE DATABASE newdb TEMPLATE template_db;
+						  ALTER DATABASE template_db IS_TEMPLATE = false;
+						  DROP DATABASE dropme_db1;
+						  DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+						  DROP TABLESPACE source_ts;
+						  DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+							   $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
+# TEST 5
+#
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).
+
+$node_primary = get_new_node('primary3');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+my $ts_primary = $node_primary->new_tablespace('dropme_ts1');
+$node_primary->safe_psql('postgres',
+						 "CREATE TABLESPACE ts1 LOCATION '$ts_primary'");
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 TABLESPACE ts1");
+
+# Take backup
+$backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+$node_standby = get_new_node('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+File::Path::rmtree($node_standby->tablespace_dir('dropme_ts1'));
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+if ($node_primary->poll_query_until(
+		'postgres',
+		'SELECT count(*) = 0 FROM pg_stat_replication',
+		't') == 1)
+{
+	pass('standby failed as expected');
+	# We know that the standby has failed.  Setting its pid to
+	# undefined avoids error when PostgreNode module tries to stop the
+	# standby node as part of tear_down sequence.
+	$node_standby->{_pid} = undef;
+}
+else
+{
+	fail('standby did not fail within 5 seconds');
+}
+
-- 
2.27.0

0002-Fix-replay-of-create-database-records-on-standby_12.txttext/plain; charset=us-asciiDownload

From bfd70d2ab7aaf5b5791c46d78e6bf087041abb0f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 28 Mar 2022 16:29:33 +0900
Subject: [PATCH 2/2] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the standby
would fail to recover in such a case.  However, the directories could be
legitimately missing.  Consider a sequence of WAL records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds a mechanism similar to invalid-page tracking, to keep a
tally of missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at the
end of crash recovery, the standby can safely continue following the
primary.

Backpatch to from 10 to 12. This fix has already been committed to 13
and later.

A new TAP test file is added to verify the condition.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 src/backend/access/transam/xlog.c      |   6 +
 src/backend/access/transam/xlogutils.c | 159 ++++++++++++++++++++++++-
 src/backend/commands/dbcommands.c      |  55 +++++++++
 src/backend/commands/tablespace.c      |  17 +++
 src/include/access/xlogutils.h         |   4 +
 src/tools/pgindent/typedefs.list       |   2 +
 6 files changed, 242 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7141e5dca8..3d3342b714 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8006,6 +8006,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..11c40b7446 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -31,6 +31,164 @@
 #include "utils/rel.h"
 
 
+
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid			spcNode;
+	Oid			dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char		path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+
+/*
+ * Keep track of a directory that wasn't found while replaying database
+ * creation records.  These should match up with tablespace removal records
+ * later in the WAL stream; we verify that before reaching consistency.
+ */
+void
+XLogRememberMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool		found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+									  100,
+									  &ctl,
+									  HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG1, "missing directory %s (tablespace %u) already exists: %s",
+				 path, spcNode, entry->path);
+		else
+			elog(DEBUG1, "missing directory %s (tablespace %u database %u) already exists: %s",
+				 path, spcNode, dbNode, entry->path);
+	}
+	else
+	{
+		strlcpy(entry->path, path, sizeof(entry->path));
+		if (dbNode == InvalidOid)
+			elog(DEBUG1, "logged missing dir %s (tablespace %u)",
+				 path, spcNode);
+		else
+			elog(DEBUG1, "logged missing dir %s (tablespace %u database %u)",
+				 path, spcNode, dbNode);
+	}
+}
+
+/*
+ * Remove an entry from the list of directories not found.  This is to be done
+ * when the matching tablespace removal WAL record is found.
+ */
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+	xl_missing_dir_key key;
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	/* Database OID may be invalid but tablespace OID must be valid. */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+		return;
+
+	if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+	{
+		if (dbNode == InvalidOid)
+		{
+			elog(DEBUG2, "forgot missing dir (tablespace %u)", spcNode);
+		}
+		else
+		{
+			char	   *path = GetDatabasePath(dbNode, spcNode);
+
+			elog(DEBUG2, "forgot missing dir %s (tablespace %u database %u)",
+				 path, spcNode, dbNode);
+			pfree(path);
+		}
+	}
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+	HASH_SEQ_STATUS status;
+	xl_missing_dir *hentry;
+	bool		foundone = false;
+
+	if (missing_dir_tab == NULL)
+		return;					/* nothing to do */
+
+	hash_seq_init(&status, missing_dir_tab);
+
+	while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+	{
+		elog(WARNING, "missing directory \"%s\" tablespace %u database %u",
+			 hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+		foundone = true;
+	}
+
+	if (foundone)
+		elog(PANIC, "WAL contains references to missing directories");
+
+	hash_destroy(missing_dir_tab);
+	missing_dir_tab = NULL;
+}
+
+
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
  * pages that no longer exist, because their relation was later dropped or
@@ -56,7 +214,6 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
-
 /* Report a reference to an invalid page */
 static void
 report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 863f89f19d..44512a8a30 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2108,7 +2108,9 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
+		bool		skip = false;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2126,6 +2128,56 @@ dbase_redo(XLogReaderState *record)
 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
 								dst_path)));
 		}
+		else if (!reachedConsistency)
+		{
+			/*
+			 * It is possible that a drop tablespace record appearing later in
+			 * WAL has already been replayed -- in other words, that we are
+			 * replaying the database creation record a second time with no
+			 * intervening checkpoint.  In that case, the tablespace directory
+			 * has already been removed and the create database operation
+			 * cannot be replayed.  Skip the replay itself, but remember the
+			 * fact that the tablespace directory is missing, to be matched
+			 * with the expected tablespace drop record later.
+			 */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogRememberMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				skip = true;
+				ereport(WARNING,
+						(errmsg("skipping replay of database creation WAL record"),
+						 errdetail("The target tablespace \"%s\" directory was not found.",
+								   parent_path),
+						 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+			}
+			pfree(parent_path);
+		}
+
+		/*
+		 * If the source directory is missing, skip the copy and make a note of
+		 * it for later.
+		 *
+		 * One possible reason for this is that the template database used for
+		 * creating this database may have been dropped, as noted above.
+		 * Moving a database from one tablespace may also be a partner in the
+		 * crime.
+		 */
+		if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+			!reachedConsistency)
+		{
+			XLogRememberMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+			skip = true;
+			ereport(WARNING,
+					(errmsg("skipping replay of database creation WAL record"),
+					 errdetail("The source database directory \"%s\" was not found.",
+							   src_path),
+					 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+		}
+
+		if (skip)
+			return;
 
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
@@ -2181,6 +2233,9 @@ dbase_redo(XLogReaderState *record)
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->tablespace_id, xlrec->db_id);
+
 		if (InHotStandby)
 		{
 			/*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index f060c24599..5b600a98ff 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -58,6 +58,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -1530,6 +1531,22 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		if (!reachedConsistency)
+			XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+		/*
+		 * Before we remove the tablespace directory, update minimum recovery
+		 * point to cover this WAL record. Once the tablespace is removed,
+		 * there's no going back.  This manually enforces the WAL-first rule.
+		 * Doing this before the removal means that if the removal fails for
+		 * some reason, the directory is left alone and needs to be manually
+		 * removed.  Alternatively we could update the minimum recovery point
+		 * after removal, but that would leave a small window where the
+		 * WAL-first rule could be violated.
+		 */
+		if (!reachedConsistency)
+			XLogFlush(record->EndRecPtr);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 4105b59904..a17c204638 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -23,6 +23,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 								 BlockNumber nblocks);
 
+extern void XLogRememberMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index daebb77387..bdf6b25d59 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3404,6 +3404,8 @@ xl_invalid_page
 xl_invalid_page_key
 xl_invalidations
 xl_logical_message
+xl_missing_dir_key
+xl_missing_dir
 xl_multi_insert_tuple
 xl_multixact_create
 xl_multixact_truncate
-- 
2.27.0

#81

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#76)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Mar 25, 2022 at 8:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Mar-21, Alvaro Herrera wrote:

I had a look at this latest version of the patch, and found some things
to tweak. Attached is v21 with three main changes from Kyotaro's v20:

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

This is quite annoying, considering that the bug was reported shortly
before 12 went into beta.

I think that the warnings this patch issues may cause some unnecessary
end-user alarm. It seems to me that they are basically warning about a
situation that is unusual but not scary. Isn't the appropriate level
for that DEBUG1, maybe without the errhint?

--
Robert Haas
EDB: http://www.enterprisedb.com

#82

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#75)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Mar 21, 2022 at 3:02 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

2. Why not instead change the code so that the operation can succeed,
by creating the prerequisite parent directories? Do we not have enough
information for that? I'm not saying that we definitely should do it
that way rather than this way, but I think we do take that approach in
some cases.

It seems we can choose freely between these two implementations -- I
mean I don't see any upsides or downsides to either one.

What got committed here feels inconsistent to me. Suppose we have a
checkpoint, and then a series of operations that touch a tablespace,
and then a drop database and drop tablespace. If the first operation
happens to be CREATE DATABASE, then this patch is going to fix it by
skipping the operation. However, if the first operation happens to be
almost anything else, the way it's going to reference the dropped
tablespace is via a block reference in a WAL record of a wide variety
of types. That's going to result in a call to
XLogReadBufferForRedoExtended() which will call
XLogReadBufferExtended() which will do smgrcreate(smgr, forknum, true)
which will in turn call TablespaceCreateDbspace() to fill in all the
missing directories.

I don't think that's very good. It would be reasonable to decide that
we're never going to create the missing directories and instead just
remember that they were not found so we can do a cross check. It's
also reasonable to just create the directories on the fly. But doing a
mix of those systems doesn't really seem like the right idea -
particularly because it means that the cross-check system is probably
not very effective at finding actual problems in the code.

Am I missing something here?

--
Robert Haas
EDB: http://www.enterprisedb.com

#83

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Robert Haas (#81)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 28 Mar 2022 10:37:04 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Fri, Mar 25, 2022 at 8:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Mar-21, Alvaro Herrera wrote:

I had a look at this latest version of the patch, and found some things
to tweak. Attached is v21 with three main changes from Kyotaro's v20:

Pushed this, backpatching to 14 and 13. It would have been good to
backpatch further, but there's an (textually trivial) merge conflict
related to commit e6d8069522c8. Because that commit conceptually
touches the same area that this bugfix is about, I'm not sure that
backpatching further without a lot more thought is wise -- particularly
so when there's no way to automate the test in branches older than
master.

This is quite annoying, considering that the bug was reported shortly
before 12 went into beta.

I think that the warnings this patch issues may cause some unnecessary
end-user alarm. It seems to me that they are basically warning about a
situation that is unusual but not scary. Isn't the appropriate level
for that DEBUG1, maybe without the errhint?

log_invalid_page reports missing pages with DEBUG1 before reaching
consistency. And since missing directory is not an issue if all of
those reports are forgotten until reaching consistency, DEBUG1 sounds
reasonable. Maybe we lower the DEBUG1 messages to DEBUG2 in
XLogRememberMissingDir?

--
Kyotaro Horiguchi
NTT Open Source Software Center

#84

Michael Paquier

michael@paquier.xyz

almost 4 years ago

In reply to: Thomas Munro (#78)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Mar 28, 2022 at 02:34:44PM +1300, Thomas Munro wrote:

Just a thought: we could consider back-patching
allow_in_place_tablespaces, after a little while, if we're happy with
how that is working out, if it'd be useful for verifying bug fixes in
back branches. It's non-end-user-facing testing infrastructure.

+1 for a backpatch on that.  That would be useful.
--
Michael

#85

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Robert Haas (#82)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 28 Mar 2022 12:17:50 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Mon, Mar 21, 2022 at 3:02 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

2. Why not instead change the code so that the operation can succeed,
by creating the prerequisite parent directories? Do we not have enough
information for that? I'm not saying that we definitely should do it
that way rather than this way, but I think we do take that approach in
some cases.

It seems we can choose freely between these two implementations -- I
mean I don't see any upsides or downsides to either one.

What got committed here feels inconsistent to me. Suppose we have a
checkpoint, and then a series of operations that touch a tablespace,
and then a drop database and drop tablespace. If the first operation
happens to be CREATE DATABASE, then this patch is going to fix it by
skipping the operation. However, if the first operation happens to be
almost anything else, the way it's going to reference the dropped
tablespace is via a block reference in a WAL record of a wide variety
of types. That's going to result in a call to
XLogReadBufferForRedoExtended() which will call
XLogReadBufferExtended() which will do smgrcreate(smgr, forknum, true)
which will in turn call TablespaceCreateDbspace() to fill in all the
missing directories.

Right. I thought that recovery avoids that but that's wrong. This
behavior creates a bare (non-linked) directly within pg_tblspc. The
directory would dissapear soon if recovery proceeds to the consistency
point, though.

I don't think that's very good. It would be reasonable to decide that
we're never going to create the missing directories and instead just
remember that they were not found so we can do a cross check. It's
also reasonable to just create the directories on the fly. But doing a
mix of those systems doesn't really seem like the right idea -
particularly because it means that the cross-check system is probably
not very effective at finding actual problems in the code.

Am I missing something here?

No. I agree that mixing them is not good. On the other hand we
already doing that by heapam. AFAICS sometimes it avoid creating a
new page but sometimes creates it. But I don't mean to use the fact
for justifying this patch to do that, or denying to do that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#86

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Kyotaro Horiguchi (#85)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Mar-29, Kyotaro Horiguchi wrote:

That's going to result in a call to
XLogReadBufferForRedoExtended() which will call
XLogReadBufferExtended() which will do smgrcreate(smgr, forknum, true)
which will in turn call TablespaceCreateDbspace() to fill in all the
missing directories.

Right. I thought that recovery avoids that but that's wrong. This
behavior creates a bare (non-linked) directly within pg_tblspc. The
directory would dissapear soon if recovery proceeds to the consistency
point, though.

Hmm, this is not good.

No. I agree that mixing them is not good. On the other hand we
already doing that by heapam. AFAICS sometimes it avoid creating a
new page but sometimes creates it. But I don't mean to use the fact
for justifying this patch to do that, or denying to do that.

I think we should revert this patch and do it again using the other
approach: create a stub directory during recovery that can be deleted
later.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?" (Mafalda)

#87

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#86)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Tue, Mar 29, 2022 at 7:37 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I think we should revert this patch and do it again using the other
approach: create a stub directory during recovery that can be deleted
later.

I'm fine with that approach, but I'd like to ask that we proceed
expeditiously, because I have another patch that I want to commit that
touches this area. I can commit to helping with whatever we decide to
do here, but I don't want to keep that patch on ice while we figure it
out and then have it miss the release.

--
Robert Haas
EDB: http://www.enterprisedb.com

#88

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Robert Haas (#87)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Mar-29, Robert Haas wrote:

I'm fine with that approach, but I'd like to ask that we proceed
expeditiously, because I have another patch that I want to commit that
touches this area. I can commit to helping with whatever we decide to
do here, but I don't want to keep that patch on ice while we figure it
out and then have it miss the release.

OK, this is a bug that's been open for years. A fix can be committed
after the feature freeze anyway.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#89

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#88)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Tue, Mar 29, 2022 at 9:28 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

OK, this is a bug that's been open for years. A fix can be committed
after the feature freeze anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com

#90

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Robert Haas (#89)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 29 Mar 2022 09:31:42 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Tue, Mar 29, 2022 at 9:28 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

OK, this is a bug that's been open for years. A fix can be committed
after the feature freeze anyway.

+1

By the way, may I ask how do we fix this? The existing recovery code
already generates just-to-be-delete files in a real directory in
pg_tblspc sometimes, and elsewise skip applying WAL records on
nonexistent heap pages. It is the "mixed" way.

1. stop XLogReadBufferForRedo creating a file in nonexistent
directories then remember the failure (I'm not sure how big the
impact is.)

2. unconditionally create all objects required for recovery to proceed..
2.1 and igore the failures.
2.2 and remember the failures.

3. Any other?

2 needs to create a real directory in pg_tblspc. So 1?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#91

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#90)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Apr 1, 2022 at 12:22 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

By the way, may I ask how do we fix this? The existing recovery code
already generates just-to-be-delete files in a real directory in
pg_tblspc sometimes, and elsewise skip applying WAL records on
nonexistent heap pages. It is the "mixed" way.

Can you be more specific about where we have each behavior now?

1. stop XLogReadBufferForRedo creating a file in nonexistent
directories then remember the failure (I'm not sure how big the
impact is.)

2. unconditionally create all objects required for recovery to proceed..
2.1 and igore the failures.
2.2 and remember the failures.

3. Any other?

2 needs to create a real directory in pg_tblspc. So 1?

I think we could either do 1 or 2. My intuition is that getting 2
working would be less scary and more likely to be something we would
feel comfortable back-patching, but 1 is probably a better design in
the long term. However, I might be wrong -- that's just a guess.

--
Robert Haas
EDB: http://www.enterprisedb.com

#92

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Robert Haas (#91)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Fri, 1 Apr 2022 14:51:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Fri, Apr 1, 2022 at 12:22 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

By the way, may I ask how do we fix this? The existing recovery code
already generates just-to-be-delete files in a real directory in
pg_tblspc sometimes, and elsewise skip applying WAL records on
nonexistent heap pages. It is the "mixed" way.

Can you be more specific about where we have each behavior now?

They're done in XLogReadBufferExtended.

The second behavior happens here,
xlogutils.c:

/* hm, page doesn't exist in file */
if (mode == RBM_NORMAL)
{
log_invalid_page(rnode, forknum, blkno, false);

+ Assert(0);

return InvalidBuffer;

With the assertion, 015_promotion_pages.pl crashes. This prevents page
creation and the following redo action on the page.

The first behavior is described as the following comment:

* Create the target file if it doesn't already exist. This lets us cope
* if the replay sequence contains writes to a relation that is later
* deleted. (The original coding of this routine would instead suppress
* the writes, but that seems like it risks losing valuable data if the
* filesystem loses an inode during a crash. Better to write the data
* until we are actually told to delete the file.)
*/
smgrcreate(smgr, forknum, true);

Without the smgrcreate call, make check-world fails due to missing
files for FSM and visibility map, and init forks, which it's a bit
doubtful that the cases fall into the category so-called "creates
inexistent objects by redo access". In a few places, XLOG_FPI records
are used to create the first page of a file including main and init
forks. But I don't see a case of main fork during make check-world.

# Most of the failure cases happen as standby freeze. I was a bit
# annoyed that make check-world doesn't tell what is the module
# currently being tested. In that case I had to deduce it from the
# sequence of preceding script names, but if the first TAP script of a
# module freezes, I had to use ps to find the module..

1. stop XLogReadBufferForRedo creating a file in nonexistent
directories then remember the failure (I'm not sure how big the
impact is.)

2. unconditionally create all objects required for recovery to proceed..
2.1 and igore the failures.
2.2 and remember the failures.

3. Any other?

2 needs to create a real directory in pg_tblspc. So 1?

I think we could either do 1 or 2. My intuition is that getting 2
working would be less scary and more likely to be something we would
feel comfortable back-patching, but 1 is probably a better design in
the long term. However, I might be wrong -- that's just a guess.

Thanks. I forgot to mention in the previous mail (but mentioned
somewhere upthread) but if we take 2, there's no way other than
creating a real directory in pg_tblspc while recovery. I don't think
it is neat.

I haven't found how the patch caused creation of a relation file that
is to be removed soon. However, I find that v19 patch fails by maybe
due to some change in Cluster.pm. It takes a bit more time to check
that..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#93

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#92)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I haven't found how the patch caused creation of a relation file that
is to be removed soon. However, I find that v19 patch fails by maybe
due to some change in Cluster.pm. It takes a bit more time to check
that..

I was a bit away, of course the wal-logged create database interfares
with the patch here. But I haven't found that why it stops creating
database directory under pg_tblspc.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#94

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#93)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Apr 4, 2022 at 2:25 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I haven't found how the patch caused creation of a relation file that
is to be removed soon. However, I find that v19 patch fails by maybe
due to some change in Cluster.pm. It takes a bit more time to check
that..

I was a bit away, of course the wal-logged create database interfares
with the patch here. But I haven't found that why it stops creating
database directory under pg_tblspc.

I did not understand what is the exact problem here, but the database
directory and the version file are created under the default
tablespace of the target database. However, other than the default
tablespace of the database, the database directory will be created
along with the smgrcreate() so that we do not create an unnecessary
directory under the tablespace where we do not have any data to be
copied.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#95

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#94)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 4 Apr 2022 21:14:27 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in

On Mon, Apr 4, 2022 at 2:25 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I haven't found how the patch caused creation of a relation file that
is to be removed soon. However, I find that v19 patch fails by maybe
due to some change in Cluster.pm. It takes a bit more time to check
that..

I was a bit away, of course the wal-logged create database interfares
with the patch here. But I haven't found that why it stops creating
database directory under pg_tblspc.

I did not understand what is the exact problem here, but the database
directory and the version file are created under the default
tablespace of the target database. However, other than the default
tablespace of the database, the database directory will be created
along with the smgrcreate() so that we do not create an unnecessary
directory under the tablespace where we do not have any data to be
copied.

Thanks. Yeah, I suspected something like that but I didn't find a
difference in the code I suspected to be related with, but it's was
wrong. I took wrong steps trying to reveal that state and faced the
wrong error message. With the correct steps, I could see that
Storage/CREATE creates pg_tblspc/<directory>.

So, if we create missing tablespace directory, we have no way
otherthan creating it directly in pg_tblspc, which is violating the
rule that there shouldn't be real directory in pg_tblspc (when
allow_in_place_tablespaces is false).

So, I have the following points in my mind for now.

- We create the directory "since we know it is just tentative state".

- Then, check that no directory in pg_tblspc when reaching consistency
when allow_in_place_tablespaces is false.

- Leave the log_invalid_page() mechanism alone as it is always result
in a corrpt page if a differential WAL record is applied on a newly
created page that should have been exist.

However, while working on it, I found that I found that recovery faces
missing tablespace directories *after* reaching consistency. I'm
examining that further.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#96

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#95)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 05 Apr 2022 11:16:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

So, I have the following points in my mind for now.

- We create the directory "since we know it is just tentative state".

- Then, check that no directory in pg_tblspc when reaching consistency
when allow_in_place_tablespaces is false.

- Leave the log_invalid_page() mechanism alone as it is always result
in a corrpt page if a differential WAL record is applied on a newly
created page that should have been exist.

However, while working on it, I found that I found that recovery faces
missing tablespace directories *after* reaching consistency. I'm
examining that further.

Okay, it was my thinko. But I faced another obstacle.

This is the first cut of the above.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v22-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From c4107b4953d251f7fab06400d2c2ae2e0a505759 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 5 Apr 2022 15:31:45 +0900
Subject: [PATCH v22] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch allows missing tablespaces to be created during recovery
before reaching consistency.  The tablespaces are created as real
directories that should not exists but will be removed until reaching
consistency. CheckRecoveryConsistency is responsible to make sure they
have disappeared.

Similar to log_invalid_page mechanism, the GUC ignore_invalid_pages
turns into PANIC errors detected by this patch into WARNING, which
allows continueing recovery.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 doc/src/sgml/config.sgml                    |   5 +-
 src/backend/access/transam/xlogrecovery.c   |  53 +++++++
 src/backend/commands/dbcommands.c           |  71 +++++++++
 src/backend/commands/tablespace.c           |  28 +---
 src/backend/utils/misc/guc.c                |   8 +-
 src/include/access/xlogutils.h              |   2 +
 src/test/recovery/t/029_replay_tsp_drops.pl | 155 ++++++++++++++++++++
 7 files changed, 290 insertions(+), 32 deletions(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 43e4ade83e..c22229468b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11228,11 +11228,12 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
       <listitem>
        <para>
         If set to <literal>off</literal> (the default), detection of
-        WAL records having references to invalid pages during
+        WAL records having references to invalid pages or
+        WAL records resulting in invalid directory operations during
         recovery causes <productname>PostgreSQL</productname> to
         raise a PANIC-level error, aborting the recovery. Setting
         <varname>ignore_invalid_pages</varname> to <literal>on</literal>
-        causes the system to ignore invalid page references in WAL records
+        causes the system to ignore invalid actions caused by such WAL records
         (but still report a warning), and continue the recovery.
         This behavior may <emphasis>cause crashes, data loss,
         propagate or hide corruption, or other serious problems</emphasis>.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 8d2395dae2..0889ba4b47 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1986,6 +1986,51 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Makes sure that ./pg_tblspc directory doesn't contain a real directory.
+ *
+ * This is intended to be called after reaching consistency.
+ * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
+ * recovery can continue.
+ * 
+ * Note that it is the normal behavior when allow_in_place_tablespaces=on, but
+ * we don't bother caring that case since it is a developer-only setting.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	char *tblspc_path = "./pg_tblspc";
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(tblspc_path);
+	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	{
+		char	path[MAXPGPATH];
+		char   *p;
+		struct stat st;
+
+		/* Skip entries of non-oid names */
+		for (p = de->d_name; *p && isdigit(*p); p++);
+		if (*p)
+			continue;
+
+		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+
+#ifndef WIN32
+		if (lstat(path, &st) < 0)
+			ereport(ERROR, errcode_for_file_access(),
+					errmsg("could not stat file \"%s\": %m", path));
+
+		if (!S_ISLNK(st.st_mode))
+#else
+		if (!pgwin32_is_junction(path))
+#endif
+			elog(ignore_invalid_pages ? WARNING : PANIC,
+				 "real directory found in pg_tblspc directory: %s", de->d_name);
+	}
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2051,6 +2096,14 @@ CheckRecoveryConsistency(void)
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
 						LSN_FORMAT_ARGS(lastReplayedEndRecPtr))));
+
+		/*
+		 * Check that pg_tblspc doesn't contain a real
+		 * directory. Database/CREATE_* records may create a tablespace
+		 * directory that should have been removed until consistency is
+		 * reached.
+		 */
+		CheckTablespaceDirectory();
 	}
 
 	/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index df16533901..910101da01 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -47,6 +48,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -62,6 +64,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/pg_locale.h"
 #include "utils/relmapper.h"
 #include "utils/snapmgr.h"
@@ -135,6 +138,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
 										Oid dst_tsid);
+static void maybe_create_directory(char *path);
 
 /*
  * Create a new database using the WAL_LOG strategy.
@@ -3003,6 +3007,43 @@ get_database_name(Oid dbid)
 	return result;
 }
 
+/*
+ * maybe_create_directory()
+ *
+ * During recovery, there's a case where we validly need to recover a missing
+ * tablespace directory so that recovery can continue.  This happens when
+ * recovery wants to create a database but the holding tablespace has been
+ * removed before the server stopped.  Since we expect that the directory will
+ * be gone before reaching recovery consistency, and we have no knowledge about
+ * the tablespace other than its OID here, we create a real directory under
+ * pg_tblspc here instead of restoring the symlink.  ignore_invalid_pages=on
+ * reduces the error level so that recovery can continue.
+ */
+static void
+maybe_create_directory(char *path)
+{
+	struct stat	st;
+
+	Assert(RecoveryInProgress());
+
+	if (stat(path, &st) == 0)
+		return;
+
+	/* XXX: Do we make sure that the path is under pg_tblspc? */
+
+	if (reachedConsistency && !ignore_invalid_pages)
+		ereport(PANIC,
+				errmsg("missing directory \"%s\"", path));
+
+	elog(reachedConsistency ? WARNING : DEBUG1,
+		 "creating missing directory: %s", path);
+
+	if (pg_mkdir_p(path, pg_dir_create_mode) != 0)
+		ereport(PANIC,
+				errmsg("could not create missing directory \"%s\": %m", path));
+}
+
+
 /*
  * DATABASE resource manager's routines
  */
@@ -3039,6 +3080,30 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		if (stat(dst_path, &st) < 0)
+		{
+			char *parent_path;
+
+			if (errno != ENOENT)
+				ereport(FATAL,
+						errmsg("could not stat directory \"%s\": %m",
+							   dst_path));
+
+			/* create the parent directory if needed and valid */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			maybe_create_directory(parent_path);
+		}
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+			maybe_create_directory(src_path);
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
@@ -3057,9 +3122,15 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
 		char	   *dbpath;
+		char	   *parent_path;
 
 		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		/* create the parent directory if needed and valid */
+		parent_path = pstrdup(dbpath);
+		get_parent_directory(parent_path);
+		maybe_create_directory(parent_path);
+
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..675f578dfe 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -155,8 +155,6 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -169,32 +167,8 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9e8ab1420d..9134a73d3d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -42,6 +42,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
@@ -139,7 +140,6 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
-extern bool ignore_invalid_pages;
 extern bool synchronize_seqscans;
 
 #ifdef TRACE_SYNCSCAN
@@ -1304,10 +1304,12 @@ static struct config_bool ConfigureNamesBool[] =
 		{"ignore_invalid_pages", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Continues recovery after an invalid pages failure."),
 			gettext_noop("Detection of WAL records having references to "
-						 "invalid pages during recovery causes PostgreSQL to "
+						 "invalid pages or WAL records resulting in invalid "
+						 "directory operations during "
+						 "recovery that cause PostgreSQL"
 						 "raise a PANIC-level error, aborting the recovery. "
 						 "Setting ignore_invalid_pages to true causes "
-						 "the system to ignore invalid page references "
+						 "the system to ignore those inconsistencies "
 						 "in WAL records (but still report a warning), "
 						 "and continue recovery. This behavior may cause "
 						 "crashes, data loss, propagate or hide corruption, "
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..d88661997f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,6 +54,8 @@ typedef enum
 
 extern HotStandbyState standbyState;
 
+extern bool ignore_invalid_pages;
+
 #define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
 
 
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..b401ab8072
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+#
+# Tests relating to PostgreSQL crash recovery and redo
+#
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_tablespace
+{
+	my ($strategy) = @_;
+
+	my $node_primary = PostgreSQL::Test::Cluster->new("primary1_$strategy");
+	$node_primary->init(allows_streaming => 1);
+	$node_primary->start;
+	$node_primary->psql('postgres',
+			qq[
+				SET allow_in_place_tablespaces=on;
+				CREATE TABLESPACE dropme_ts1 LOCATION '';
+				CREATE TABLESPACE dropme_ts2 LOCATION '';
+				CREATE TABLESPACE source_ts  LOCATION '';
+				CREATE TABLESPACE target_ts  LOCATION '';
+				CREATE DATABASE template_db IS_TEMPLATE = true;
+			]);
+	my $backup_name = 'my_backup';
+	$node_primary->backup($backup_name);
+
+	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
+	$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+	$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");	
+	$node_standby->start;
+
+	# Make sure connection is made
+	$node_primary->poll_query_until(
+		'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+	$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+	# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+	# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+	# to be applied to already-removed directories.
+	my $query = q[
+	CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1 STRATEGY=<STRATEGY>;
+	CREATE TABLE t (a int) TABLESPACE dropme_ts2;
+	CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2 STRATEGY=<STRATEGY>;
+	CREATE DATABASE moveme_db TABLESPACE source_ts STRATEGY=<STRATEGY>;
+	ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+	CREATE DATABASE newdb TEMPLATE template_db STRATEGY=<STRATEGY>;
+	ALTER DATABASE template_db IS_TEMPLATE = false;
+	DROP DATABASE dropme_db1;
+	DROP TABLE t;
+	DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+	DROP TABLESPACE source_ts;
+	DROP DATABASE template_db;];
+
+	$query =~ s/<STRATEGY>/$strategy/g;
+	$node_primary->safe_psql('postgres', $query);
+	$node_primary->wait_for_catchup($node_standby, 'replay',
+									$node_primary->lsn('replay'));
+
+	# show "create missing directory" log message
+	$node_standby->safe_psql('postgres',
+							 "ALTER SYSTEM SET log_min_messages TO debug1;");
+	$node_standby->stop('immediate');
+	# Should restart ignoring directory creation error.
+	is($node_standby->start(fail_ok => 1), 1);
+	$node_standby->stop('immediate');
+}	
+
+test_tablespace("FILE_COPY");
+test_tablespace("WAL_LOG");
+
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).  This is
+# effective only for CREATE DATABASE WITH STRATEGY=FILE_COPY.
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+$node_primary->safe_psql('postgres', q[
+						 SET allow_in_place_tablespaces=on;
+						 CREATE TABLESPACE ts1 LOCATION '']);
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 WITH TABLESPACE ts1 STRATEGY=FILE_COPY");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");	
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+my $tspoid = $node_standby->safe_psql('postgres',
+									  "SELECT oid FROM pg_tablespace WHERE spcname = 'ts1';");
+my $tspdir = $node_standby->data_dir . "/pg_tblspc/$tspoid";
+File::Path::rmtree($tspdir);
+
+my $logstart = get_log_size($node_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1 STRATEGY=FILE_COPY;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+# In this test, PANIC turns into WARNING by ignore_invalid_pages.
+# Check the log messages instead of confirming standby failure.
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log(
+				 $node_standby,
+				 "WARNING:  creating missing directory: pg_tblspc/",
+				 $logstart));
+	sleep 1;
+}
+ok($max_attempts > 0, "invalid directory creation is detected");
+
+done_testing();
+
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#97

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#96)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 05 Apr 2022 16:38:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

However, while working on it, I found that I found that recovery faces
missing tablespace directories *after* reaching consistency. I'm
examining that further.

Okay, it was my thinko. But I faced another obstacle.

I forgot to delete the second sentence. Please ingore it.

regareds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#98

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#96)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 05 Apr 2022 16:38:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 05 Apr 2022 11:16:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

So, I have the following points in my mind for now.

- We create the directory "since we know it is just tentative state".

- Then, check that no directory in pg_tblspc when reaching consistency
when allow_in_place_tablespaces is false.

- Leave the log_invalid_page() mechanism alone as it is always result
in a corrpt page if a differential WAL record is applied on a newly
created page that should have been exist.

However, while working on it, I found that I found that recovery faces
missing tablespace directories *after* reaching consistency. I'm
examining that further.

Okay, it was my thinko.

This is the first cut of the above.

It had an unused variable for Windows.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v23-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-patch; charset=us-asciiDownload

From 1e7f5e5e10ea504e168d01b3db8be12c2f63b6d6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 5 Apr 2022 15:31:45 +0900
Subject: [PATCH v23] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch allows missing tablespaces to be created during recovery
before reaching consistency.  The tablespaces are created as real
directories that should not exists but will be removed until reaching
consistency. CheckRecoveryConsistency is responsible to make sure they
have disappeared.

Similar to log_invalid_page mechanism, the GUC ignore_invalid_pages
turns into PANIC errors detected by this patch into WARNING, which
allows continueing recovery.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 doc/src/sgml/config.sgml                    |   5 +-
 src/backend/access/transam/xlogrecovery.c   |  55 +++++++
 src/backend/commands/dbcommands.c           |  71 +++++++++
 src/backend/commands/tablespace.c           |  28 +---
 src/backend/utils/misc/guc.c                |   8 +-
 src/include/access/xlogutils.h              |   2 +
 src/test/recovery/t/029_replay_tsp_drops.pl | 155 ++++++++++++++++++++
 7 files changed, 292 insertions(+), 32 deletions(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 43e4ade83e..c22229468b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11228,11 +11228,12 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
       <listitem>
        <para>
         If set to <literal>off</literal> (the default), detection of
-        WAL records having references to invalid pages during
+        WAL records having references to invalid pages or
+        WAL records resulting in invalid directory operations during
         recovery causes <productname>PostgreSQL</productname> to
         raise a PANIC-level error, aborting the recovery. Setting
         <varname>ignore_invalid_pages</varname> to <literal>on</literal>
-        causes the system to ignore invalid page references in WAL records
+        causes the system to ignore invalid actions caused by such WAL records
         (but still report a warning), and continue the recovery.
         This behavior may <emphasis>cause crashes, data loss,
         propagate or hide corruption, or other serious problems</emphasis>.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 8d2395dae2..18dcc452ca 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1986,6 +1986,53 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Makes sure that ./pg_tblspc directory doesn't contain a real directory.
+ *
+ * This is intended to be called after reaching consistency.
+ * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
+ * recovery can continue.
+ * 
+ * Note that it is the normal behavior when allow_in_place_tablespaces=on, but
+ * we don't bother caring that case since it is a developer-only setting.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	char *tblspc_path = "./pg_tblspc";
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(tblspc_path);
+	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	{
+		char	path[MAXPGPATH];
+		char   *p;
+#ifndef WIN32
+		struct stat st;
+#endif
+
+		/* Skip entries of non-oid names */
+		for (p = de->d_name; *p && isdigit(*p); p++);
+		if (*p)
+			continue;
+
+		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+
+#ifndef WIN32
+		if (lstat(path, &st) < 0)
+			ereport(ERROR, errcode_for_file_access(),
+					errmsg("could not stat file \"%s\": %m", path));
+
+		if (!S_ISLNK(st.st_mode))
+#else
+		if (!pgwin32_is_junction(path))
+#endif
+			elog(ignore_invalid_pages ? WARNING : PANIC,
+				 "real directory found in pg_tblspc directory: %s", de->d_name);
+	}
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2051,6 +2098,14 @@ CheckRecoveryConsistency(void)
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
 						LSN_FORMAT_ARGS(lastReplayedEndRecPtr))));
+
+		/*
+		 * Check that pg_tblspc doesn't contain a real
+		 * directory. Database/CREATE_* records may create a tablespace
+		 * directory that should have been removed until consistency is
+		 * reached.
+		 */
+		CheckTablespaceDirectory();
 	}
 
 	/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index df16533901..910101da01 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -47,6 +48,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -62,6 +64,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/pg_locale.h"
 #include "utils/relmapper.h"
 #include "utils/snapmgr.h"
@@ -135,6 +138,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
 										Oid dst_tsid);
+static void maybe_create_directory(char *path);
 
 /*
  * Create a new database using the WAL_LOG strategy.
@@ -3003,6 +3007,43 @@ get_database_name(Oid dbid)
 	return result;
 }
 
+/*
+ * maybe_create_directory()
+ *
+ * During recovery, there's a case where we validly need to recover a missing
+ * tablespace directory so that recovery can continue.  This happens when
+ * recovery wants to create a database but the holding tablespace has been
+ * removed before the server stopped.  Since we expect that the directory will
+ * be gone before reaching recovery consistency, and we have no knowledge about
+ * the tablespace other than its OID here, we create a real directory under
+ * pg_tblspc here instead of restoring the symlink.  ignore_invalid_pages=on
+ * reduces the error level so that recovery can continue.
+ */
+static void
+maybe_create_directory(char *path)
+{
+	struct stat	st;
+
+	Assert(RecoveryInProgress());
+
+	if (stat(path, &st) == 0)
+		return;
+
+	/* XXX: Do we make sure that the path is under pg_tblspc? */
+
+	if (reachedConsistency && !ignore_invalid_pages)
+		ereport(PANIC,
+				errmsg("missing directory \"%s\"", path));
+
+	elog(reachedConsistency ? WARNING : DEBUG1,
+		 "creating missing directory: %s", path);
+
+	if (pg_mkdir_p(path, pg_dir_create_mode) != 0)
+		ereport(PANIC,
+				errmsg("could not create missing directory \"%s\": %m", path));
+}
+
+
 /*
  * DATABASE resource manager's routines
  */
@@ -3039,6 +3080,30 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		if (stat(dst_path, &st) < 0)
+		{
+			char *parent_path;
+
+			if (errno != ENOENT)
+				ereport(FATAL,
+						errmsg("could not stat directory \"%s\": %m",
+							   dst_path));
+
+			/* create the parent directory if needed and valid */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			maybe_create_directory(parent_path);
+		}
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+			maybe_create_directory(src_path);
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
@@ -3057,9 +3122,15 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
 		char	   *dbpath;
+		char	   *parent_path;
 
 		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		/* create the parent directory if needed and valid */
+		parent_path = pstrdup(dbpath);
+		get_parent_directory(parent_path);
+		maybe_create_directory(parent_path);
+
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..675f578dfe 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -155,8 +155,6 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -169,32 +167,8 @@ TablespaceCreateDbspace(Oid spcNode, Oid dbNode, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9e8ab1420d..9134a73d3d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -42,6 +42,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
@@ -139,7 +140,6 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
-extern bool ignore_invalid_pages;
 extern bool synchronize_seqscans;
 
 #ifdef TRACE_SYNCSCAN
@@ -1304,10 +1304,12 @@ static struct config_bool ConfigureNamesBool[] =
 		{"ignore_invalid_pages", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Continues recovery after an invalid pages failure."),
 			gettext_noop("Detection of WAL records having references to "
-						 "invalid pages during recovery causes PostgreSQL to "
+						 "invalid pages or WAL records resulting in invalid "
+						 "directory operations during "
+						 "recovery that cause PostgreSQL"
 						 "raise a PANIC-level error, aborting the recovery. "
 						 "Setting ignore_invalid_pages to true causes "
-						 "the system to ignore invalid page references "
+						 "the system to ignore those inconsistencies "
 						 "in WAL records (but still report a warning), "
 						 "and continue recovery. This behavior may cause "
 						 "crashes, data loss, propagate or hide corruption, "
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..d88661997f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,6 +54,8 @@ typedef enum
 
 extern HotStandbyState standbyState;
 
+extern bool ignore_invalid_pages;
+
 #define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
 
 
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..b401ab8072
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+#
+# Tests relating to PostgreSQL crash recovery and redo
+#
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_tablespace
+{
+	my ($strategy) = @_;
+
+	my $node_primary = PostgreSQL::Test::Cluster->new("primary1_$strategy");
+	$node_primary->init(allows_streaming => 1);
+	$node_primary->start;
+	$node_primary->psql('postgres',
+			qq[
+				SET allow_in_place_tablespaces=on;
+				CREATE TABLESPACE dropme_ts1 LOCATION '';
+				CREATE TABLESPACE dropme_ts2 LOCATION '';
+				CREATE TABLESPACE source_ts  LOCATION '';
+				CREATE TABLESPACE target_ts  LOCATION '';
+				CREATE DATABASE template_db IS_TEMPLATE = true;
+			]);
+	my $backup_name = 'my_backup';
+	$node_primary->backup($backup_name);
+
+	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
+	$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+	$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");	
+	$node_standby->start;
+
+	# Make sure connection is made
+	$node_primary->poll_query_until(
+		'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+	$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+	# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+	# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+	# to be applied to already-removed directories.
+	my $query = q[
+	CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1 STRATEGY=<STRATEGY>;
+	CREATE TABLE t (a int) TABLESPACE dropme_ts2;
+	CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2 STRATEGY=<STRATEGY>;
+	CREATE DATABASE moveme_db TABLESPACE source_ts STRATEGY=<STRATEGY>;
+	ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+	CREATE DATABASE newdb TEMPLATE template_db STRATEGY=<STRATEGY>;
+	ALTER DATABASE template_db IS_TEMPLATE = false;
+	DROP DATABASE dropme_db1;
+	DROP TABLE t;
+	DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+	DROP TABLESPACE source_ts;
+	DROP DATABASE template_db;];
+
+	$query =~ s/<STRATEGY>/$strategy/g;
+	$node_primary->safe_psql('postgres', $query);
+	$node_primary->wait_for_catchup($node_standby, 'replay',
+									$node_primary->lsn('replay'));
+
+	# show "create missing directory" log message
+	$node_standby->safe_psql('postgres',
+							 "ALTER SYSTEM SET log_min_messages TO debug1;");
+	$node_standby->stop('immediate');
+	# Should restart ignoring directory creation error.
+	is($node_standby->start(fail_ok => 1), 1);
+	$node_standby->stop('immediate');
+}	
+
+test_tablespace("FILE_COPY");
+test_tablespace("WAL_LOG");
+
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).  This is
+# effective only for CREATE DATABASE WITH STRATEGY=FILE_COPY.
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+$node_primary->safe_psql('postgres', q[
+						 SET allow_in_place_tablespaces=on;
+						 CREATE TABLESPACE ts1 LOCATION '']);
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 WITH TABLESPACE ts1 STRATEGY=FILE_COPY");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");	
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+my $tspoid = $node_standby->safe_psql('postgres',
+									  "SELECT oid FROM pg_tablespace WHERE spcname = 'ts1';");
+my $tspdir = $node_standby->data_dir . "/pg_tblspc/$tspoid";
+File::Path::rmtree($tspdir);
+
+my $logstart = get_log_size($node_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1 STRATEGY=FILE_COPY;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+# In this test, PANIC turns into WARNING by ignore_invalid_pages.
+# Check the log messages instead of confirming standby failure.
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log(
+				 $node_standby,
+				 "WARNING:  creating missing directory: pg_tblspc/",
+				 $logstart));
+	sleep 1;
+}
+ok($max_attempts > 0, "invalid directory creation is detected");
+
+done_testing();
+
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#99

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Kyotaro Horiguchi (#98)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Not a review, just a preparatory rebase across some trivially
conflicting changes. I also noticed that
src/test/recovery/t/031_recovery_conflict.pl, which was added two days
after v23 was sent, and which uses allow_in_place_tablespaces, bails out
because of the checks introduced by this patch, so I made the check
routine do nothing in that case.

Anyway, here's v24.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Attachments:

v24-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-diff; charset=us-asciiDownload

From de2c77f5e2c31c911674377a51359ba8fe662c96 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 13 Jul 2022 18:14:18 +0200
Subject: [PATCH v24] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch allows missing tablespaces to be created during recovery
before reaching consistency.  The tablespaces are created as real
directories that should not exists but will be removed until reaching
consistency. CheckRecoveryConsistency is responsible to make sure they
have disappeared.

Similar to log_invalid_page mechanism, the GUC ignore_invalid_pages
turns into PANIC errors detected by this patch into WARNING, which
allows continueing recovery.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 doc/src/sgml/config.sgml                    |   5 +-
 src/backend/access/transam/xlogrecovery.c   |  59 ++++++++
 src/backend/commands/dbcommands.c           |  71 +++++++++
 src/backend/commands/tablespace.c           |  28 +---
 src/backend/utils/misc/guc.c                |   8 +-
 src/include/access/xlogutils.h              |   2 +
 src/test/recovery/t/029_replay_tsp_drops.pl | 155 ++++++++++++++++++++
 7 files changed, 296 insertions(+), 32 deletions(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 37fd80388c..1e1c8c1cb7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11363,11 +11363,12 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
       <listitem>
        <para>
         If set to <literal>off</literal> (the default), detection of
-        WAL records having references to invalid pages during
+        WAL records having references to invalid pages or
+        WAL records resulting in invalid directory operations during
         recovery causes <productname>PostgreSQL</productname> to
         raise a PANIC-level error, aborting the recovery. Setting
         <varname>ignore_invalid_pages</varname> to <literal>on</literal>
-        causes the system to ignore invalid page references in WAL records
+        causes the system to ignore invalid actions caused by such WAL records
         (but still report a warning), and continue the recovery.
         This behavior may <emphasis>cause crashes, data loss,
         propagate or hide corruption, or other serious problems</emphasis>.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..ae81244e06 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2008,6 +2008,57 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Makes sure that ./pg_tblspc directory doesn't contain a real directory.
+ *
+ * This is intended to be called after reaching consistency.
+ * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
+ * recovery can continue.
+ *
+ * This can't be checked in allow_in_place_tablespaces mode, so skip it in
+ * that case.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	char *tblspc_path = "./pg_tblspc";
+	DIR		   *dir;
+	struct dirent *de;
+
+	/* Do not check for this when test tablespaces are in use */
+	if (allow_in_place_tablespaces)
+		return;
+
+	dir = AllocateDir(tblspc_path);
+	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	{
+		char	path[MAXPGPATH];
+		char   *p;
+#ifndef WIN32
+		struct stat st;
+#endif
+
+		/* Skip entries of non-oid names */
+		for (p = de->d_name; *p && isdigit(*p); p++);
+		if (*p)
+			continue;
+
+		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+
+#ifndef WIN32
+		if (lstat(path, &st) < 0)
+			ereport(ERROR, errcode_for_file_access(),
+					errmsg("could not stat file \"%s\": %m", path));
+
+		if (!S_ISLNK(st.st_mode))
+#else
+		if (!pgwin32_is_junction(path))
+#endif
+			elog(ignore_invalid_pages ? WARNING : PANIC,
+				 "real directory found in pg_tblspc directory: %s", de->d_name);
+	}
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2072,6 +2123,14 @@ CheckRecoveryConsistency(void)
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
 						LSN_FORMAT_ARGS(lastReplayedEndRecPtr))));
+
+		/*
+		 * Check that pg_tblspc doesn't contain a real
+		 * directory. Database/CREATE_* records may create a tablespace
+		 * directory that should have been removed until consistency is
+		 * reached.
+		 */
+		CheckTablespaceDirectory();
 	}
 
 	/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1901b434c5..0f860030fa 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -47,6 +48,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -62,6 +64,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/pg_locale.h"
 #include "utils/relmapper.h"
 #include "utils/snapmgr.h"
@@ -135,6 +138,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
 										Oid dst_tsid);
+static void maybe_create_directory(char *path);
 
 /*
  * Create a new database using the WAL_LOG strategy.
@@ -3008,6 +3012,43 @@ get_database_name(Oid dbid)
 	return result;
 }
 
+/*
+ * maybe_create_directory()
+ *
+ * During recovery, there's a case where we validly need to recover a missing
+ * tablespace directory so that recovery can continue.  This happens when
+ * recovery wants to create a database but the holding tablespace has been
+ * removed before the server stopped.  Since we expect that the directory will
+ * be gone before reaching recovery consistency, and we have no knowledge about
+ * the tablespace other than its OID here, we create a real directory under
+ * pg_tblspc here instead of restoring the symlink.  ignore_invalid_pages=on
+ * reduces the error level so that recovery can continue.
+ */
+static void
+maybe_create_directory(char *path)
+{
+	struct stat	st;
+
+	Assert(RecoveryInProgress());
+
+	if (stat(path, &st) == 0)
+		return;
+
+	/* XXX: Do we make sure that the path is under pg_tblspc? */
+
+	if (reachedConsistency && !ignore_invalid_pages)
+		ereport(PANIC,
+				errmsg("missing directory \"%s\"", path));
+
+	elog(reachedConsistency ? WARNING : DEBUG1,
+		 "creating missing directory: %s", path);
+
+	if (pg_mkdir_p(path, pg_dir_create_mode) != 0)
+		ereport(PANIC,
+				errmsg("could not create missing directory \"%s\": %m", path));
+}
+
+
 /*
  * DATABASE resource manager's routines
  */
@@ -3044,6 +3085,30 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		if (stat(dst_path, &st) < 0)
+		{
+			char *parent_path;
+
+			if (errno != ENOENT)
+				ereport(FATAL,
+						errmsg("could not stat directory \"%s\": %m",
+							   dst_path));
+
+			/* create the parent directory if needed and valid */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			maybe_create_directory(parent_path);
+		}
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+			maybe_create_directory(src_path);
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
@@ -3068,9 +3133,15 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
 		char	   *dbpath;
+		char	   *parent_path;
 
 		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		/* create the parent directory if needed and valid */
+		parent_path = pstrdup(dbpath);
+		get_parent_directory(parent_path);
+		maybe_create_directory(parent_path);
+
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index c8bdd9992a..ddb031a83f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -156,8 +156,6 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -170,32 +168,8 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0328029d43..cd6fa84e22 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -43,6 +43,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
@@ -141,7 +142,6 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
-extern bool ignore_invalid_pages;
 extern bool synchronize_seqscans;
 
 #ifdef TRACE_SYNCSCAN
@@ -1336,10 +1336,12 @@ static struct config_bool ConfigureNamesBool[] =
 		{"ignore_invalid_pages", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Continues recovery after an invalid pages failure."),
 			gettext_noop("Detection of WAL records having references to "
-						 "invalid pages during recovery causes PostgreSQL to "
+						 "invalid pages or WAL records resulting in invalid "
+						 "directory operations during "
+						 "recovery that cause PostgreSQL"
 						 "raise a PANIC-level error, aborting the recovery. "
 						 "Setting ignore_invalid_pages to true causes "
-						 "the system to ignore invalid page references "
+						 "the system to ignore those inconsistencies "
 						 "in WAL records (but still report a warning), "
 						 "and continue recovery. This behavior may cause "
 						 "crashes, data loss, propagate or hide corruption, "
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index ef182977bf..f203fdf539 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,6 +54,8 @@ typedef enum
 
 extern PGDLLIMPORT HotStandbyState standbyState;
 
+extern bool ignore_invalid_pages;
+
 #define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
 
 
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..d537fe0ce5
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+#
+# Tests relating to PostgreSQL crash recovery and redo
+#
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_tablespace
+{
+	my ($strategy) = @_;
+
+	my $node_primary = PostgreSQL::Test::Cluster->new("primary1_$strategy");
+	$node_primary->init(allows_streaming => 1);
+	$node_primary->start;
+	$node_primary->psql('postgres',
+			qq[
+				SET allow_in_place_tablespaces=on;
+				CREATE TABLESPACE dropme_ts1 LOCATION '';
+				CREATE TABLESPACE dropme_ts2 LOCATION '';
+				CREATE TABLESPACE source_ts  LOCATION '';
+				CREATE TABLESPACE target_ts  LOCATION '';
+				CREATE DATABASE template_db IS_TEMPLATE = true;
+			]);
+	my $backup_name = 'my_backup';
+	$node_primary->backup($backup_name);
+
+	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
+	$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+	$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");
+	$node_standby->start;
+
+	# Make sure connection is made
+	$node_primary->poll_query_until(
+		'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+	$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+	# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+	# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+	# to be applied to already-removed directories.
+	my $query = q[
+	CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1 STRATEGY=<STRATEGY>;
+	CREATE TABLE t (a int) TABLESPACE dropme_ts2;
+	CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2 STRATEGY=<STRATEGY>;
+	CREATE DATABASE moveme_db TABLESPACE source_ts STRATEGY=<STRATEGY>;
+	ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+	CREATE DATABASE newdb TEMPLATE template_db STRATEGY=<STRATEGY>;
+	ALTER DATABASE template_db IS_TEMPLATE = false;
+	DROP DATABASE dropme_db1;
+	DROP TABLE t;
+	DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+	DROP TABLESPACE source_ts;
+	DROP DATABASE template_db;];
+
+	$query =~ s/<STRATEGY>/$strategy/g;
+	$node_primary->safe_psql('postgres', $query);
+	$node_primary->wait_for_catchup($node_standby, 'replay',
+									$node_primary->lsn('replay'));
+
+	# show "create missing directory" log message
+	$node_standby->safe_psql('postgres',
+							 "ALTER SYSTEM SET log_min_messages TO debug1;");
+	$node_standby->stop('immediate');
+	# Should restart ignoring directory creation error.
+	is($node_standby->start(fail_ok => 1), 1);
+	$node_standby->stop('immediate');
+}
+
+test_tablespace("FILE_COPY");
+test_tablespace("WAL_LOG");
+
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).  This is
+# effective only for CREATE DATABASE WITH STRATEGY=FILE_COPY.
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+$node_primary->safe_psql('postgres', q[
+						 SET allow_in_place_tablespaces=on;
+						 CREATE TABLESPACE ts1 LOCATION '']);
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 WITH TABLESPACE ts1 STRATEGY=FILE_COPY");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+my $tspoid = $node_standby->safe_psql('postgres',
+									  "SELECT oid FROM pg_tablespace WHERE spcname = 'ts1';");
+my $tspdir = $node_standby->data_dir . "/pg_tblspc/$tspoid";
+File::Path::rmtree($tspdir);
+
+my $logstart = get_log_size($node_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1 STRATEGY=FILE_COPY;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+# In this test, PANIC turns into WARNING by ignore_invalid_pages.
+# Check the log messages instead of confirming standby failure.
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log(
+				 $node_standby,
+				 "WARNING:  creating missing directory: pg_tblspc/",
+				 $logstart));
+	sleep 1;
+}
+ok($max_attempts > 0, "invalid directory creation is detected");
+
+done_testing();
+
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.30.2

#100

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#99)

4 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Here's a couple of fixups. 0001 is the same as before. In 0002 I think
CheckTablespaceDirectory ends up easier to read if we split out the test
for validity of the link. Looking at that again, I think we don't need
to piggyback on ignore_invalid_pages, which is already a stretch, so
let's not -- instead we can use allow_in_place_tablespaces if users need
a workaround. So that's 0003 (this bit needs more than zero docs,
however.)

0004 is straightforward: let's check for bad directories before logging
about consistent state.

After all this, I'm not sure what to think of dbase_redo. At line 3102,
is the directory supposed to exist or not? I'm confused as to what is
the expected state at that point. I rewrote this, but now I think my
rewrite continues to be confusing, so I'll have to think more about it.

Another aspect are the tests. Robert described a scenario where the
previously committed version of this patch created trouble. Do we have
a test case to cover that problematic case? I think we should strive to
cover it, if possible.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

Attachments:

v25-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-diff; charset=us-asciiDownload

From bdb691c2a86301e5369b3ae7f735fa5f7c29304d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 13 Jul 2022 18:14:18 +0200
Subject: [PATCH v25 1/4] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch allows missing tablespaces to be created during recovery
before reaching consistency.  The tablespaces are created as real
directories that should not exists but will be removed until reaching
consistency. CheckRecoveryConsistency is responsible to make sure they
have disappeared.

Similar to log_invalid_page mechanism, the GUC ignore_invalid_pages
turns into PANIC errors detected by this patch into WARNING, which
allows continueing recovery.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 doc/src/sgml/config.sgml                    |   5 +-
 src/backend/access/transam/xlogrecovery.c   |  59 ++++++++
 src/backend/commands/dbcommands.c           |  71 +++++++++
 src/backend/commands/tablespace.c           |  28 +---
 src/backend/utils/misc/guc.c                |   8 +-
 src/include/access/xlogutils.h              |   2 +
 src/test/recovery/t/029_replay_tsp_drops.pl | 155 ++++++++++++++++++++
 7 files changed, 296 insertions(+), 32 deletions(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 37fd80388c..1e1c8c1cb7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11363,11 +11363,12 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
       <listitem>
        <para>
         If set to <literal>off</literal> (the default), detection of
-        WAL records having references to invalid pages during
+        WAL records having references to invalid pages or
+        WAL records resulting in invalid directory operations during
         recovery causes <productname>PostgreSQL</productname> to
         raise a PANIC-level error, aborting the recovery. Setting
         <varname>ignore_invalid_pages</varname> to <literal>on</literal>
-        causes the system to ignore invalid page references in WAL records
+        causes the system to ignore invalid actions caused by such WAL records
         (but still report a warning), and continue the recovery.
         This behavior may <emphasis>cause crashes, data loss,
         propagate or hide corruption, or other serious problems</emphasis>.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..ae81244e06 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2008,6 +2008,57 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Makes sure that ./pg_tblspc directory doesn't contain a real directory.
+ *
+ * This is intended to be called after reaching consistency.
+ * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
+ * recovery can continue.
+ *
+ * This can't be checked in allow_in_place_tablespaces mode, so skip it in
+ * that case.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	char *tblspc_path = "./pg_tblspc";
+	DIR		   *dir;
+	struct dirent *de;
+
+	/* Do not check for this when test tablespaces are in use */
+	if (allow_in_place_tablespaces)
+		return;
+
+	dir = AllocateDir(tblspc_path);
+	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	{
+		char	path[MAXPGPATH];
+		char   *p;
+#ifndef WIN32
+		struct stat st;
+#endif
+
+		/* Skip entries of non-oid names */
+		for (p = de->d_name; *p && isdigit(*p); p++);
+		if (*p)
+			continue;
+
+		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+
+#ifndef WIN32
+		if (lstat(path, &st) < 0)
+			ereport(ERROR, errcode_for_file_access(),
+					errmsg("could not stat file \"%s\": %m", path));
+
+		if (!S_ISLNK(st.st_mode))
+#else
+		if (!pgwin32_is_junction(path))
+#endif
+			elog(ignore_invalid_pages ? WARNING : PANIC,
+				 "real directory found in pg_tblspc directory: %s", de->d_name);
+	}
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2072,6 +2123,14 @@ CheckRecoveryConsistency(void)
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
 						LSN_FORMAT_ARGS(lastReplayedEndRecPtr))));
+
+		/*
+		 * Check that pg_tblspc doesn't contain a real
+		 * directory. Database/CREATE_* records may create a tablespace
+		 * directory that should have been removed until consistency is
+		 * reached.
+		 */
+		CheckTablespaceDirectory();
 	}
 
 	/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1901b434c5..0f860030fa 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -47,6 +48,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -62,6 +64,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/pg_locale.h"
 #include "utils/relmapper.h"
 #include "utils/snapmgr.h"
@@ -135,6 +138,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
 										Oid dst_tsid);
+static void maybe_create_directory(char *path);
 
 /*
  * Create a new database using the WAL_LOG strategy.
@@ -3008,6 +3012,43 @@ get_database_name(Oid dbid)
 	return result;
 }
 
+/*
+ * maybe_create_directory()
+ *
+ * During recovery, there's a case where we validly need to recover a missing
+ * tablespace directory so that recovery can continue.  This happens when
+ * recovery wants to create a database but the holding tablespace has been
+ * removed before the server stopped.  Since we expect that the directory will
+ * be gone before reaching recovery consistency, and we have no knowledge about
+ * the tablespace other than its OID here, we create a real directory under
+ * pg_tblspc here instead of restoring the symlink.  ignore_invalid_pages=on
+ * reduces the error level so that recovery can continue.
+ */
+static void
+maybe_create_directory(char *path)
+{
+	struct stat	st;
+
+	Assert(RecoveryInProgress());
+
+	if (stat(path, &st) == 0)
+		return;
+
+	/* XXX: Do we make sure that the path is under pg_tblspc? */
+
+	if (reachedConsistency && !ignore_invalid_pages)
+		ereport(PANIC,
+				errmsg("missing directory \"%s\"", path));
+
+	elog(reachedConsistency ? WARNING : DEBUG1,
+		 "creating missing directory: %s", path);
+
+	if (pg_mkdir_p(path, pg_dir_create_mode) != 0)
+		ereport(PANIC,
+				errmsg("could not create missing directory \"%s\": %m", path));
+}
+
+
 /*
  * DATABASE resource manager's routines
  */
@@ -3044,6 +3085,30 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		if (stat(dst_path, &st) < 0)
+		{
+			char *parent_path;
+
+			if (errno != ENOENT)
+				ereport(FATAL,
+						errmsg("could not stat directory \"%s\": %m",
+							   dst_path));
+
+			/* create the parent directory if needed and valid */
+			parent_path = pstrdup(dst_path);
+			get_parent_directory(parent_path);
+			maybe_create_directory(parent_path);
+		}
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+			maybe_create_directory(src_path);
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
@@ -3068,9 +3133,15 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
 		char	   *dbpath;
+		char	   *parent_path;
 
 		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		/* create the parent directory if needed and valid */
+		parent_path = pstrdup(dbpath);
+		get_parent_directory(parent_path);
+		maybe_create_directory(parent_path);
+
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index c8bdd9992a..ddb031a83f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -156,8 +156,6 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -170,32 +168,8 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0328029d43..cd6fa84e22 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -43,6 +43,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
@@ -141,7 +142,6 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
-extern bool ignore_invalid_pages;
 extern bool synchronize_seqscans;
 
 #ifdef TRACE_SYNCSCAN
@@ -1336,10 +1336,12 @@ static struct config_bool ConfigureNamesBool[] =
 		{"ignore_invalid_pages", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Continues recovery after an invalid pages failure."),
 			gettext_noop("Detection of WAL records having references to "
-						 "invalid pages during recovery causes PostgreSQL to "
+						 "invalid pages or WAL records resulting in invalid "
+						 "directory operations during "
+						 "recovery that cause PostgreSQL"
 						 "raise a PANIC-level error, aborting the recovery. "
 						 "Setting ignore_invalid_pages to true causes "
-						 "the system to ignore invalid page references "
+						 "the system to ignore those inconsistencies "
 						 "in WAL records (but still report a warning), "
 						 "and continue recovery. This behavior may cause "
 						 "crashes, data loss, propagate or hide corruption, "
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index ef182977bf..f203fdf539 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,6 +54,8 @@ typedef enum
 
 extern PGDLLIMPORT HotStandbyState standbyState;
 
+extern bool ignore_invalid_pages;
+
 #define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
 
 
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..d537fe0ce5
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+#
+# Tests relating to PostgreSQL crash recovery and redo
+#
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_tablespace
+{
+	my ($strategy) = @_;
+
+	my $node_primary = PostgreSQL::Test::Cluster->new("primary1_$strategy");
+	$node_primary->init(allows_streaming => 1);
+	$node_primary->start;
+	$node_primary->psql('postgres',
+			qq[
+				SET allow_in_place_tablespaces=on;
+				CREATE TABLESPACE dropme_ts1 LOCATION '';
+				CREATE TABLESPACE dropme_ts2 LOCATION '';
+				CREATE TABLESPACE source_ts  LOCATION '';
+				CREATE TABLESPACE target_ts  LOCATION '';
+				CREATE DATABASE template_db IS_TEMPLATE = true;
+			]);
+	my $backup_name = 'my_backup';
+	$node_primary->backup($backup_name);
+
+	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
+	$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+	$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");
+	$node_standby->start;
+
+	# Make sure connection is made
+	$node_primary->poll_query_until(
+		'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+	$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+	# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+	# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+	# to be applied to already-removed directories.
+	my $query = q[
+	CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1 STRATEGY=<STRATEGY>;
+	CREATE TABLE t (a int) TABLESPACE dropme_ts2;
+	CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2 STRATEGY=<STRATEGY>;
+	CREATE DATABASE moveme_db TABLESPACE source_ts STRATEGY=<STRATEGY>;
+	ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+	CREATE DATABASE newdb TEMPLATE template_db STRATEGY=<STRATEGY>;
+	ALTER DATABASE template_db IS_TEMPLATE = false;
+	DROP DATABASE dropme_db1;
+	DROP TABLE t;
+	DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+	DROP TABLESPACE source_ts;
+	DROP DATABASE template_db;];
+
+	$query =~ s/<STRATEGY>/$strategy/g;
+	$node_primary->safe_psql('postgres', $query);
+	$node_primary->wait_for_catchup($node_standby, 'replay',
+									$node_primary->lsn('replay'));
+
+	# show "create missing directory" log message
+	$node_standby->safe_psql('postgres',
+							 "ALTER SYSTEM SET log_min_messages TO debug1;");
+	$node_standby->stop('immediate');
+	# Should restart ignoring directory creation error.
+	is($node_standby->start(fail_ok => 1), 1);
+	$node_standby->stop('immediate');
+}
+
+test_tablespace("FILE_COPY");
+test_tablespace("WAL_LOG");
+
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).  This is
+# effective only for CREATE DATABASE WITH STRATEGY=FILE_COPY.
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+$node_primary->safe_psql('postgres', q[
+						 SET allow_in_place_tablespaces=on;
+						 CREATE TABLESPACE ts1 LOCATION '']);
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 WITH TABLESPACE ts1 STRATEGY=FILE_COPY");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "ignore_invalid_pages = on");
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+my $tspoid = $node_standby->safe_psql('postgres',
+									  "SELECT oid FROM pg_tablespace WHERE spcname = 'ts1';");
+my $tspdir = $node_standby->data_dir . "/pg_tblspc/$tspoid";
+File::Path::rmtree($tspdir);
+
+my $logstart = get_log_size($node_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1 STRATEGY=FILE_COPY;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+# In this test, PANIC turns into WARNING by ignore_invalid_pages.
+# Check the log messages instead of confirming standby failure.
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log(
+				 $node_standby,
+				 "WARNING:  creating missing directory: pg_tblspc/",
+				 $logstart));
+	sleep 1;
+}
+ok($max_attempts > 0, "invalid directory creation is detected");
+
+done_testing();
+
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.30.2

v25-0002-split-is_path_tslink-as-a-new-routine.patchtext/x-diff; charset=us-asciiDownload

From 87d05718d69cf26d0f2017dc30e30f0f62dcbbff Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 14 Jul 2022 16:43:03 +0200
Subject: [PATCH v25 2/4] split is_path_tslink as a new routine

---
 src/backend/access/transam/xlogrecovery.c | 68 +++++++++++++----------
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae81244e06..e04d30cf3e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2008,54 +2008,66 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Is the given directory entry a symlink/junction point?  Subroutine for
+ * CheckTablespaceDirectory.
+ */
+static bool
+is_path_tslink(const char *path)
+{
+#ifndef WIN32
+		struct stat st;
+
+		if (lstat(path, &st) < 0)
+			ereport(ERROR, errcode_for_file_access(),
+					errmsg("could not stat file \"%s\": %m", path));
+		return S_ISLNK(st.st_mode);
+#else
+		return pgwin32_is_junction(path);
+#endif
+}
+
 /*
  * Makes sure that ./pg_tblspc directory doesn't contain a real directory.
  *
- * This is intended to be called after reaching consistency.
- * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
- * recovery can continue.
+ * Replay of database creation XLOG records for databases that were later
+ * dropped can create fake directories in pg_tblspc.  By the time consistency
+ * is reached these directories should have been removed; here we verify
+ * that this did indeed happen.  This must be called after reached consistent
+ * state.
  *
- * This can't be checked in allow_in_place_tablespaces mode, so skip it in
- * that case.
+ * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
+ * recovery can continue.  XXX piggybacking on this particular GUC sounds like
+ * a bad idea.  Why not just advise to use allow_in_place_tablespaces?
  */
 static void
 CheckTablespaceDirectory(void)
 {
-	char *tblspc_path = "./pg_tblspc";
 	DIR		   *dir;
 	struct dirent *de;
 
-	/* Do not check for this when test tablespaces are in use */
+	/*
+	 * In allow_in_place_tablespaces mode, it is valid to have non-symlink
+	 * directories in pg_tblspc, so we cannot run this check.  Give up.
+	 */
 	if (allow_in_place_tablespaces)
 		return;
 
-	dir = AllocateDir(tblspc_path);
-	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	dir = AllocateDir("pg_tblspc");
+	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)
 	{
-		char	path[MAXPGPATH];
-		char   *p;
-#ifndef WIN32
-		struct stat st;
-#endif
+		char	path[MAXPGPATH + 10];
 
 		/* Skip entries of non-oid names */
-		for (p = de->d_name; *p && isdigit(*p); p++);
-		if (*p)
+		if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
 			continue;
 
-		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);
 
-#ifndef WIN32
-		if (lstat(path, &st) < 0)
-			ereport(ERROR, errcode_for_file_access(),
-					errmsg("could not stat file \"%s\": %m", path));
-
-		if (!S_ISLNK(st.st_mode))
-#else
-		if (!pgwin32_is_junction(path))
-#endif
-			elog(ignore_invalid_pages ? WARNING : PANIC,
-				 "real directory found in pg_tblspc directory: %s", de->d_name);
+		if (!is_path_tslink(path))
+			ereport(ignore_invalid_pages ? WARNING : PANIC,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("real directory found in pg_tblspc directory: %s", de->d_name)));
 	}
 }
 
-- 
2.30.2

v25-0003-Don-t-rely-on-ignore_invalid_pages-at-all.patchtext/x-diff; charset=us-asciiDownload

From 0e03c82ee07569b869b382fac83d98d0b5d5d870 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 14 Jul 2022 23:39:08 +0200
Subject: [PATCH v25 3/4] Don't rely on ignore_invalid_pages at all

Since a workaround with allow_in_place_tablespaces is possible,
there doesn't seem to be a need for the ignore_invalid_pages one.
Remove it.
---
 src/backend/access/transam/xlogrecovery.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index e04d30cf3e..b0ae63fbac 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2035,10 +2035,6 @@ is_path_tslink(const char *path)
  * is reached these directories should have been removed; here we verify
  * that this did indeed happen.  This must be called after reached consistent
  * state.
- *
- * ignore_invalid_pages=on turns into the PANIC error into WARNING so that
- * recovery can continue.  XXX piggybacking on this particular GUC sounds like
- * a bad idea.  Why not just advise to use allow_in_place_tablespaces?
  */
 static void
 CheckTablespaceDirectory(void)
@@ -2065,7 +2061,7 @@ CheckTablespaceDirectory(void)
 		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);
 
 		if (!is_path_tslink(path))
-			ereport(ignore_invalid_pages ? WARNING : PANIC,
+			ereport(PANIC,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("real directory found in pg_tblspc directory: %s", de->d_name)));
 	}
-- 
2.30.2

v25-0004-do-CheckTablespaceDirectory-first.patchtext/x-diff; charset=us-asciiDownload

From 427186297c41e38c61ebb73d07e962400c7ccc22 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 14 Jul 2022 16:43:18 +0200
Subject: [PATCH v25 4/4] do CheckTablespaceDirectory first

---
 src/backend/access/transam/xlogrecovery.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b0ae63fbac..c457c36daa 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2127,18 +2127,18 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check that pg_tblspc doesn't contain any real directories.
+		 * Replay of Database/CREATE_* records may have created ficticious
+		 * tablespace directories that should have been removed by the time
+		 * consistency was reached.
+		 */
+		CheckTablespaceDirectory();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
 						LSN_FORMAT_ARGS(lastReplayedEndRecPtr))));
-
-		/*
-		 * Check that pg_tblspc doesn't contain a real
-		 * directory. Database/CREATE_* records may create a tablespace
-		 * directory that should have been removed until consistency is
-		 * reached.
-		 */
-		CheckTablespaceDirectory();
 	}
 
 	/*
-- 
2.30.2

#101

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#100)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 14 Jul 2022 23:47:40 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Here's a couple of fixups. 0001 is the same as before. In 0002 I think

Thanks!

+ 		if (!S_ISLNK(st.st_mode))
+ #else
+ 		if (!pgwin32_is_junction(path))
+ #endif
+ 			elog(ignore_invalid_pages ? WARNING : PANIC,
+ 				 "real directory found in pg_tblspc directory: %s", de->d_name);

A regular file with an oid-name also causes this error. Doesn't
something like "unexpected non-(sym)link entry..." work?

CheckTablespaceDirectory ends up easier to read if we split out the test
for validity of the link. Looking at that again, I think we don't need
to piggyback on ignore_invalid_pages, which is already a stretch, so
let's not -- instead we can use allow_in_place_tablespaces if users need
a workaround. So that's 0003 (this bit needs more than zero docs,
however.)

The result of 0003 looks good.

0002:
+is_path_tslink(const char *path)

What the "ts" of tslink stands for? If it stands for tablespace, the
function is not specific for table spaces. We already have

+ errmsg("could not stat file \"%s\": %m", path));

I'm not sure we need such correctness, but what is failing there is
lstat. I found similar codes in two places in backend and one place
in frontend. So couldn't it be moved to /common and have a more
generic name?

-	dir = AllocateDir(tblspc_path);
-	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	dir = AllocateDir("pg_tblspc");
+	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)

xlog.c uses the macro XLOGDIR. Why don't we define TBLSPCDIR?

-		for (p = de->d_name; *p && isdigit(*p); p++);
-		if (*p)
+		if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
 			continue;

The pattern "strspn != strlen" looks kind of remote, or somewhat
pedantic..

+		char	path[MAXPGPATH + 10];
..
-		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);

I don't think we need the extra 10 bytes. A bit paranoic, but we can
check the return value to confirm the d_name is fully stored in the
buffer.

0004 is straightforward: let's check for bad directories before logging
about consistent state.

I was about to write a comment to do this when looking 0001.

After all this, I'm not sure what to think of dbase_redo. At line 3102,
is the directory supposed to exist or not? I'm confused as to what is
the expected state at that point. I rewrote this, but now I think my
rewrite continues to be confusing, so I'll have to think more about it.

I'm not sure l3102 exactly points, but haven't we chosen to create
everything required to keep recovery going, whether it is supposed to
exist or not?

Another aspect are the tests. Robert described a scenario where the
previously committed version of this patch created trouble. Do we have
a test case to cover that problematic case? I think we should strive to
cover it, if possible.

I counldn't recall that clearly and failed to dig out from the thread,
but doesn't the "creating everything needed" strategy naturally save
that case? We could add that test, but it seems to me a little
cumbersome to confirm the test correctly detect that case..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#102

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Kyotaro Horiguchi (#101)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-15, Kyotaro Horiguchi wrote:

At Thu, 14 Jul 2022 23:47:40 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Here's a couple of fixups. 0001 is the same as before. In 0002 I think

Thanks!
+ 		if (!S_ISLNK(st.st_mode))
+ #else
+ 		if (!pgwin32_is_junction(path))
+ #endif
+ 			elog(ignore_invalid_pages ? WARNING : PANIC,
+ 				 "real directory found in pg_tblspc directory: %s", de->d_name);
A regular file with an oid-name also causes this error. Doesn't
something like "unexpected non-(sym)link entry..." work?

Hmm, good point. I also wonder if we need to cater for using the term
"junction point" rather than "symlink" when under Windows.

CheckTablespaceDirectory ends up easier to read if we split out the test
for validity of the link. Looking at that again, I think we don't need
to piggyback on ignore_invalid_pages, which is already a stretch, so
let's not -- instead we can use allow_in_place_tablespaces if users need
a workaround. So that's 0003 (this bit needs more than zero docs,
however.)

The result of 0003 looks good.

Great, will merge.

0002:
+is_path_tslink(const char *path)

What the "ts" of tslink stands for? If it stands for tablespace, the
function is not specific for table spaces.

Oh, of course.

We already have

+ errmsg("could not stat file \"%s\": %m", path));

I'm not sure we need such correctness, but what is failing there is
lstat.

I'll have a look at what we use for lstat failures in other places.

I found similar codes in two places in backend and one place
in frontend. So couldn't it be moved to /common and have a more
generic name?

I'll have a look at those. I had the same instinct initially ...

-	dir = AllocateDir(tblspc_path);
-	while ((de = ReadDir(dir, tblspc_path)) != NULL)
+	dir = AllocateDir("pg_tblspc");
+	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)

xlog.c uses the macro XLOGDIR. Why don't we define TBLSPCDIR?

Oh yes, let's do that. I'd even backpatch that, to avoid a future
backpatching gotcha.

-		for (p = de->d_name; *p && isdigit(*p); p++);
-		if (*p)
+		if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
continue;

The pattern "strspn != strlen" looks kind of remote, or somewhat
pedantic..

+		char	path[MAXPGPATH + 10];
..
-		snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);

I don't think we need the extra 10 bytes.

I forgot to mention this, but I just copied these bits from some other
place that processes pg_tblspc entries. It seemed to me that the
bodiless for loop was a bit too suspicious-looking.

A bit paranoic, but we can check the return value to confirm the
d_name is fully stored in the buffer.

Hmm ... I don't think we need to care about that in this patch. This
coding pattern is already being used in other places. If we want to
change that, let's do it everywhere, and not in an unrelated
backpatchable bug fix.

After all this, I'm not sure what to think of dbase_redo. At line 3102,
is the directory supposed to exist or not? I'm confused as to what is
the expected state at that point. I rewrote this, but now I think my
rewrite continues to be confusing, so I'll have to think more about it.

I'm not sure l3102 exactly points, but haven't we chosen to create
everything required to keep recovery going, whether it is supposed to
exist or not?

I mean just after the two stat() calls for the target directory.

Another aspect are the tests. Robert described a scenario where the
previously committed version of this patch created trouble. Do we have
a test case to cover that problematic case? I think we should strive to
cover it, if possible.

I counldn't recall that clearly and failed to dig out from the thread,
but doesn't the "creating everything needed" strategy naturally save
that case? We could add that test, but it seems to me a little
cumbersome to confirm the test correctly detect that case..

Well, I *hope* it does ... but hope is no strategy, and I've frequently
been on the wrong side when trusting that untested code does what I
think it does.

Thanks for reviewing,

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there. Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html

#103

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Kyotaro Horiguchi (#101)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-15, Kyotaro Horiguchi wrote:

0002:
+is_path_tslink(const char *path)

What the "ts" of tslink stands for? If it stands for tablespace, the
function is not specific for table spaces. We already have

+ errmsg("could not stat file \"%s\": %m", path));

I'm not sure we need such correctness, but what is failing there is
lstat. I found similar codes in two places in backend and one place
in frontend. So couldn't it be moved to /common and have a more
generic name?

I wondered whether it'd be better to check whether get_dirent_type
returns PGFILETYPE_LNK. However, that doesn't deal with junction points
at all, which seems pretty odd ... I mean, isn't it rather useful as an
abstraction if it doesn't abstract away the one platform-dependent point
we have in the area?

However, looking closer I noticed that on Windows we use our own
readdir() implementation, which AFAICT includes everything to handle
reparse points as symlinks correctly in get_dirent_type. Which means
that do_pg_start_backup is wasting its time with the "#ifdef WIN32" bits
to handle junction points separately. We could just do this

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b809a2152c..4966213fde 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8302,13 +8302,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			 * we sometimes use allow_in_place_tablespaces to create
 			 * directories directly under pg_tblspc, which would fail below.
 			 */
-#ifdef WIN32
-			if (!pgwin32_is_junction(fullpath))
-				continue;
-#else
 			if (get_dirent_type(fullpath, de, false, ERROR) != PGFILETYPE_LNK)
 				continue;
-#endif

#if defined(HAVE_READLINK) || defined(WIN32)
rllen = readlink(fullpath, linkpath, sizeof(linkpath));

And everything should continue to work.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#104

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#103)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-15, Alvaro Herrera wrote:

However, looking closer I noticed that on Windows we use our own
readdir() implementation, which AFAICT includes everything to handle
reparse points as symlinks correctly in get_dirent_type. Which means
that do_pg_start_backup is wasting its time with the "#ifdef WIN32" bits
to handle junction points separately. We could just do this
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b809a2152c..4966213fde 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8302,13 +8302,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, TimeLineID *starttli_p,
* we sometimes use allow_in_place_tablespaces to create
* directories directly under pg_tblspc, which would fail below.
*/
-#ifdef WIN32
-			if (!pgwin32_is_junction(fullpath))
-				continue;
-#else
if (get_dirent_type(fullpath, de, false, ERROR) != PGFILETYPE_LNK)
continue;
-#endif
#if defined(HAVE_READLINK) || defined(WIN32)
rllen = readlink(fullpath, linkpath, sizeof(linkpath));

And everything should continue to work.

Hmm, but it does not:
https://cirrus-ci.com/build/4824963784900608

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#105

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#100)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

v26 here. I spent some time fighting the readdir() stuff for
Windows (so that get_dirent_type returns LNK for junction points)
but couldn't make it to work and was unable to figure out why.
So I ended up doing what do_pg_backup_start is already doing:
an #ifdef to call pgwin32_is_junction instead. I remove the
newly added path_is_symlink function, because I realized that
it would mean an extra syscall everywhere other than Windows.

So if somebody wants to fix get_dirent_type() so that it works properly
on Windows, we can change all these places together.

I also change the use of allow_invalid_pages to
allow_in_place_tablespaces. We could add a
separate GUC for this, but it seems overengineering.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
sources, so let's move on." (Nathaniel Smith)

Attachments:

v26-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-diff; charset=us-asciiDownload

From 26a0be53592a20aa09501e9015f77a4b3c3c3302 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 13 Jul 2022 18:14:18 +0200
Subject: [PATCH v26] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch allows missing tablespaces to be created during recovery
before reaching consistency.  The tablespaces are created as real
directories that should not exists but will be removed until reaching
consistency. CheckRecoveryConsistency is responsible to make sure they
have disappeared.

The problems detected by this new code are reported as PANIC, except
when allow_in_place_tablespaces is set to ON, in which case they are
WARNING.  Apart from making tests possible, this gives users an escape
hatch in case things don't go as planned.

Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c   |  54 +++++++
 src/backend/commands/dbcommands.c           |  77 ++++++++++
 src/backend/commands/tablespace.c           |  28 +---
 src/test/recovery/t/033_replay_tsp_drops.pl | 155 ++++++++++++++++++++
 4 files changed, 287 insertions(+), 27 deletions(-)
 create mode 100644 src/test/recovery/t/033_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..850ab6d7e6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
 #include "access/xlogutils.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
@@ -2008,6 +2009,51 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Verify that, in non-test mode, ./pg_tblspc doesn't contain any real
+ * directories.
+ *
+ * Replay of database creation XLOG records for databases that were later
+ * dropped can create fake directories in pg_tblspc.  By the time consistency
+ * is reached these directories should have been removed; here we verify
+ * that this did indeed happen.  This is to be called at the point where
+ * consistent state is reached.
+ *
+ * allow_in_place_tablespaces turns the PANIC into a WARNING, which is
+ * useful for testing purposes, and also allows for an escape hatch in case
+ * things go south.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir("pg_tblspc");
+	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)
+	{
+		char		path[MAXPGPATH + 10];
+
+		/* Skip entries of non-oid names */
+		if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+			continue;
+
+		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);
+
+#ifdef WIN32
+		if (!pgwin32_is_junction(path))
+#else
+		if (get_dirent_type(path, de, false, ERROR) != PGFILETYPE_LNK)
+#endif
+			ereport(allow_in_place_tablespaces ? WARNING : PANIC,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("unexpected directory entry \"%s\" found in %s",
+							de->d_name, "pg_tblspc/"),
+					 errdetail("All directory entries in pg_tblspc/ should be symbolic links."),
+					 errhint("Remove those directories, or set allow_in_place_tablespaces to ON transiently to let recovery complete.")));
+	}
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2068,6 +2114,14 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check that pg_tblspc doesn't contain any real directories. Replay
+		 * of Database/CREATE_* records may have created ficticious tablespace
+		 * directories that should have been removed by the time consistency
+		 * was reached.
+		 */
+		CheckTablespaceDirectory();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 099d369b2f..95844bbb69 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -47,6 +48,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -62,6 +64,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/pg_locale.h"
 #include "utils/relmapper.h"
 #include "utils/snapmgr.h"
@@ -135,6 +138,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
 										Oid dst_tsid);
+static void recovery_create_dbdir(char *path, bool only_tblspc);
 
 /*
  * Create a new database using the WAL_LOG strategy.
@@ -2995,6 +2999,45 @@ get_database_name(Oid dbid)
 	return result;
 }
 
+/*
+ * recovery_create_dbdir()
+ *
+ * During recovery, there's a case where we validly need to recover a missing
+ * tablespace directory so that recovery can continue.  This happens when
+ * recovery wants to create a database but the holding tablespace has been
+ * removed before the server stopped.  Since we expect that the directory will
+ * be gone before reaching recovery consistency, and we have no knowledge about
+ * the tablespace other than its OID here, we create a real directory under
+ * pg_tblspc here instead of restoring the symlink.
+ *
+ * If only_tblspc is true, then the requested directory must be in pg_tblspc/
+ */
+static void
+recovery_create_dbdir(char *path, bool only_tblspc)
+{
+	struct stat st;
+
+	Assert(RecoveryInProgress());
+
+	if (stat(path, &st) == 0)
+		return;
+
+	if (only_tblspc && strstr(path, "pg_tblspc/") == NULL)
+		elog(PANIC, "requested to created invalid directory: %s", path);
+
+	if (reachedConsistency && !allow_in_place_tablespaces)
+		ereport(PANIC,
+				errmsg("missing directory \"%s\"", path));
+
+	elog(reachedConsistency ? WARNING : DEBUG1,
+		 "creating missing directory: %s", path);
+
+	if (pg_mkdir_p(path, pg_dir_create_mode) != 0)
+		ereport(PANIC,
+				errmsg("could not create missing directory \"%s\": %m", path));
+}
+
+
 /*
  * DATABASE resource manager's routines
  */
@@ -3012,6 +3055,7 @@ dbase_redo(XLogReaderState *record)
 		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
@@ -3031,6 +3075,33 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		/*
+		 * If the parent of the target path doesn't exist, create it now. This
+		 * enables us to create the target underneath later.
+		 */
+		parent_path = pstrdup(dst_path);
+		get_parent_directory(parent_path);
+		if (stat(parent_path, &st) < 0)
+		{
+			if (errno != ENOENT)
+				ereport(FATAL,
+						errmsg("could not stat directory \"%s\": %m",
+							   dst_path));
+
+			/* create the parent directory if needed and valid */
+			recovery_create_dbdir(parent_path, true);
+		}
+		pfree(parent_path);
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+			recovery_create_dbdir(src_path, false);
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
@@ -3055,9 +3126,15 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
 		char	   *dbpath;
+		char	   *parent_path;
 
 		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		/* create the parent directory if needed and valid */
+		parent_path = pstrdup(dbpath);
+		get_parent_directory(parent_path);
+		recovery_create_dbdir(parent_path, true);
+
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index cb7d46089a..a23097399e 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -156,8 +156,6 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -170,32 +168,8 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 					 * continue by creating simple parent directories rather
 					 * than a symlink.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
 					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
diff --git a/src/test/recovery/t/033_replay_tsp_drops.pl b/src/test/recovery/t/033_replay_tsp_drops.pl
new file mode 100644
index 0000000000..0986df45e6
--- /dev/null
+++ b/src/test/recovery/t/033_replay_tsp_drops.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test replay of tablespace/database creation/drop
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_tablespace
+{
+	my ($strategy) = @_;
+
+	my $node_primary = PostgreSQL::Test::Cluster->new("primary1_$strategy");
+	$node_primary->init(allows_streaming => 1);
+	$node_primary->start;
+	$node_primary->psql('postgres',
+			qq[
+				SET allow_in_place_tablespaces=on;
+				CREATE TABLESPACE dropme_ts1 LOCATION '';
+				CREATE TABLESPACE dropme_ts2 LOCATION '';
+				CREATE TABLESPACE source_ts  LOCATION '';
+				CREATE TABLESPACE target_ts  LOCATION '';
+				CREATE DATABASE template_db IS_TEMPLATE = true;
+			]);
+	my $backup_name = 'my_backup';
+	$node_primary->backup($backup_name);
+
+	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
+	$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+	$node_standby->append_conf('postgresql.conf', "allow_in_place_tablespaces = on");
+	$node_standby->start;
+
+	# Make sure connection is made
+	$node_primary->poll_query_until(
+		'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+	$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+	# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+	# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+	# to be applied to already-removed directories.
+	my $query = q[
+	CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1 STRATEGY=<STRATEGY>;
+	CREATE TABLE t (a int) TABLESPACE dropme_ts2;
+	CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2 STRATEGY=<STRATEGY>;
+	CREATE DATABASE moveme_db TABLESPACE source_ts STRATEGY=<STRATEGY>;
+	ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+	CREATE DATABASE newdb TEMPLATE template_db STRATEGY=<STRATEGY>;
+	ALTER DATABASE template_db IS_TEMPLATE = false;
+	DROP DATABASE dropme_db1;
+	DROP TABLE t;
+	DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+	DROP TABLESPACE source_ts;
+	DROP DATABASE template_db;];
+
+	$query =~ s/<STRATEGY>/$strategy/g;
+	$node_primary->safe_psql('postgres', $query);
+	$node_primary->wait_for_catchup($node_standby, 'replay',
+									$node_primary->lsn('replay'));
+
+	# show "create missing directory" log message
+	$node_standby->safe_psql('postgres',
+							 "ALTER SYSTEM SET log_min_messages TO debug1;");
+	$node_standby->stop('immediate');
+	# Should restart ignoring directory creation error.
+	is($node_standby->start(fail_ok => 1), 1, "standby node started for $strategy");
+	$node_standby->stop('immediate');
+}
+
+test_tablespace("FILE_COPY");
+test_tablespace("WAL_LOG");
+
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).  This is
+# effective only for CREATE DATABASE WITH STRATEGY=FILE_COPY.
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+$node_primary->safe_psql('postgres', q[
+						 SET allow_in_place_tablespaces=on;
+						 CREATE TABLESPACE ts1 LOCATION '']);
+$node_primary->safe_psql('postgres', "CREATE DATABASE db1 WITH TABLESPACE ts1 STRATEGY=FILE_COPY");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "allow_in_place_tablespaces = on");
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+my $tspoid = $node_standby->safe_psql('postgres',
+									  "SELECT oid FROM pg_tablespace WHERE spcname = 'ts1';");
+my $tspdir = $node_standby->data_dir . "/pg_tblspc/$tspoid";
+File::Path::rmtree($tspdir);
+
+my $logstart = get_log_size($node_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql('postgres',
+						q[CREATE TABLE should_not_replay_insertion(a int);
+						  CREATE DATABASE db2 WITH TABLESPACE ts1 STRATEGY=FILE_COPY;
+						  INSERT INTO should_not_replay_insertion VALUES (1);]);
+
+# Standby should fail and should not silently skip replaying the wal
+# In this test, PANIC turns into WARNING by allow_in_place_tablespaces.
+# Check the log messages instead of confirming standby failure.
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log(
+				 $node_standby,
+				 "WARNING:  creating missing directory: pg_tblspc/",
+				 $logstart));
+	sleep 1;
+}
+ok($max_attempts > 0, "invalid directory creation is detected");
+
+done_testing();
+
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.30.2

#106

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#105)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-20, Alvaro Herrera wrote:

I also change the use of allow_invalid_pages to
allow_in_place_tablespaces. We could add a
separate GUC for this, but it seems overengineering.

Oh, but allow_in_place_tablespaces doesn't exist in versions 14 and
older, so this strategy doesn't really work.

I see the following alternatives:

1. not backpatch this fix to 14 and older
2. use a different GUC; either allow_invalid_pages as previously
suggested, or create a new one just for this purpose
3. not provide any overriding mechanism in versions 14 and older

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine." (Julien PUYDT)

#107

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#106)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-20, Alvaro Herrera wrote:

On 2022-Jul-20, Alvaro Herrera wrote:

I also change the use of allow_invalid_pages to
allow_in_place_tablespaces. We could add a
separate GUC for this, but it seems overengineering.

Oh, but allow_in_place_tablespaces doesn't exist in versions 14 and
older, so this strategy doesn't really work.

... and get_dirent_type is new in 14, so that'll be one more hurdle.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Cuando no hay humildad las personas se degradan" (A. Christie)

#108

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#106)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-20, Alvaro Herrera wrote:

I see the following alternatives:

1. not backpatch this fix to 14 and older
2. use a different GUC; either allow_invalid_pages as previously
suggested, or create a new one just for this purpose
3. not provide any overriding mechanism in versions 14 and older

I've got no opinions on this. I don't like either 1 or 3, so I'm going
to add and backpatch a new GUC allow_recovery_tablespaces as the
override mechanism.

If others disagree with this choice, please speak up.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#109

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#105)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Jul 20, 2022 at 10:51 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

v26 here. I spent some time fighting the readdir() stuff for
Windows (so that get_dirent_type returns LNK for junction points)
but couldn't make it to work and was unable to figure out why.

Was it because of this?

/messages/by-id/CA+hUKGKv+736Pc8kSj3=DijDGd1eC79-uT3Vi16n7jYkcc_raw@mail.gmail.com

#110

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#108)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Thu, Jul 21, 2022 at 11:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Jul-20, Alvaro Herrera wrote:

I see the following alternatives:

1. not backpatch this fix to 14 and older
2. use a different GUC; either allow_invalid_pages as previously
suggested, or create a new one just for this purpose
3. not provide any overriding mechanism in versions 14 and older

I've got no opinions on this. I don't like either 1 or 3, so I'm going
to add and backpatch a new GUC allow_recovery_tablespaces as the
override mechanism.

If others disagree with this choice, please speak up.

Would it help if we back-patched the allow_in_place_tablespaces stuff?
I'm not sure how hard/destabilising that would be, but I could take a
look tomorrow.

#111

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Thomas Munro (#109)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-21, Thomas Munro wrote:

On Wed, Jul 20, 2022 at 10:51 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

v26 here. I spent some time fighting the readdir() stuff for
Windows (so that get_dirent_type returns LNK for junction points)
but couldn't make it to work and was unable to figure out why.

Was it because of this?

/messages/by-id/CA+hUKGKv+736Pc8kSj3=DijDGd1eC79-uT3Vi16n7jYkcc_raw@mail.gmail.com

Oh, that sounds very likely, yeah. I didn't think of testing the
FILE_ATTRIBUTE_DIRECTORY bit for junction points.

I +1 pushing both of these patches to 14. Then this patch becomes a
couple of lines shorter.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Before you were born your parents weren't as boring as they are now. They
got that way paying your bills, cleaning up your room and listening to you
tell them how idealistic you are." -- Charles J. Sykes' advice to teenagers

#112

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Thomas Munro (#110)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-21, Thomas Munro wrote:

On Thu, Jul 21, 2022 at 11:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I've got no opinions on this. I don't like either 1 or 3, so I'm going
to add and backpatch a new GUC allow_recovery_tablespaces as the
override mechanism.

If others disagree with this choice, please speak up.

Would it help if we back-patched the allow_in_place_tablespaces stuff?
I'm not sure how hard/destabilising that would be, but I could take a
look tomorrow.

Yeah, I think that would reduce cruft. I'm not sure this is more
against backpatching policy or less, compared to adding a separate
GUC just for this bugfix.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
(Paul Graham)

#113

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#112)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-21, Alvaro Herrera wrote:

Yeah, I think that would reduce cruft. I'm not sure this is more
against backpatching policy or less, compared to adding a separate
GUC just for this bugfix.

cruft:

{
{"allow_recovery_tablespaces", PG_POSTMASTER, WAL_RECOVERY,
gettext_noop("Continues recovery after finding invalid database directories."),
gettext_noop("It is possible for tablespace drop to interfere with database creation "
"so that WAL replay is forced to create fake database directories. "
"These should have been dropped by the time recovery ends; "
"but in case they aren't, this option lets recovery continue if they "
"are present. Note that these directories must be removed manually afterwards."),
GUC_NOT_IN_SAMPLE
},
&allow_recovery_tablespaces,
false,
NULL, NULL, NULL
},

This is not a very good explanation, but I don't know how to make it
better.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

#114

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Thomas Munro (#110)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 21 Jul 2022 23:14:57 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Thu, Jul 21, 2022 at 11:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Jul-20, Alvaro Herrera wrote:

I see the following alternatives:

1. not backpatch this fix to 14 and older
2. use a different GUC; either allow_invalid_pages as previously
suggested, or create a new one just for this purpose
3. not provide any overriding mechanism in versions 14 and older

I've got no opinions on this. I don't like either 1 or 3, so I'm going
to add and backpatch a new GUC allow_recovery_tablespaces as the
override mechanism.

If others disagree with this choice, please speak up.

Would it help if we back-patched the allow_in_place_tablespaces stuff?
I'm not sure how hard/destabilising that would be, but I could take a
look tomorrow.

+1. Addiotional reason for me is it is a developer option.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#115

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#113)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Thu, 21 Jul 2022 13:25:05 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2022-Jul-21, Alvaro Herrera wrote:

Yeah, I think that would reduce cruft. I'm not sure this is more
against backpatching policy or less, compared to adding a separate
GUC just for this bugfix.

cruft:

{
{"allow_recovery_tablespaces", PG_POSTMASTER, WAL_RECOVERY,
gettext_noop("Continues recovery after finding invalid database directories."),
gettext_noop("It is possible for tablespace drop to interfere with database creation "
"so that WAL replay is forced to create fake database directories. "
"These should have been dropped by the time recovery ends; "
"but in case they aren't, this option lets recovery continue if they "
"are present. Note that these directories must be removed manually afterwards."),
GUC_NOT_IN_SAMPLE
},
&allow_recovery_tablespaces,
false,
NULL, NULL, NULL
},

This is not a very good explanation, but I don't know how to make it
better.

It looks a bit too detailed. I crafted the following..

Recovery can create tentative in-place tablespace directories under
pg_tblspc/. They are assumed to be removed until reaching recovery
consistency, but otherwise PostgreSQL raises a PANIC-level error,
aborting the recovery. Setting allow_recovery_tablespaces to true
causes the system to allow such directories during normal
operation. In case those directories are left after reaching
consistency, that implies data loss and metadata inconsistency and may
cause failure of future tablespace creation.

Though, after writing this, I became to think that piggy-backing on
allow_in_place_tablespaces might be a bit different..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#116

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Kyotaro Horiguchi (#114)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-22, Kyotaro Horiguchi wrote:

At Thu, 21 Jul 2022 23:14:57 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

Would it help if we back-patched the allow_in_place_tablespaces stuff?
I'm not sure how hard/destabilising that would be, but I could take a
look tomorrow.

+1. Addiotional reason for me is it is a developer option.

OK, I'll wait for allow_in_place_tablespaces to be backpatched then.

I would like to get this fix pushed before the next set of minors, so if
you won't have time for the backpatches early enough, maybe I can work
on getting it done.

Which commits would we consider?

7170f2159fb2 Allow "in place" tablespaces.
f6f0db4d6240 Fix pg_tablespace_location() with in-place tablespaces

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
sources, so let's move on." (Nathaniel Smith)

#117

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#116)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Fri, 22 Jul 2022 10:18:58 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

OK, I'll wait for allow_in_place_tablespaces to be backpatched then.

I would like to get this fix pushed before the next set of minors, so if
you won't have time for the backpatches early enough, maybe I can work
on getting it done.

Which commits would we consider?

7170f2159fb2 Allow "in place" tablespaces.
f6f0db4d6240 Fix pg_tablespace_location() with in-place tablespaces

The second one is just to make the function work with in-place
tablespaces. Without it the function yeilds the following error.

ERROR: could not read symbolic link "pg_tblspc/16407": Invalid argument

This looks actually odd but I think no need of back-patching because
there's no actual user of the feature is not seen in our test suite.
If we have a test that needs the feature in future, it would be enough
to back-patch it then.

So I think only the first one is needed for now.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#118

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#116)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Jul 22, 2022 at 8:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Jul-22, Kyotaro Horiguchi wrote:

At Thu, 21 Jul 2022 23:14:57 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

Would it help if we back-patched the allow_in_place_tablespaces stuff?
I'm not sure how hard/destabilising that would be, but I could take a
look tomorrow.

+1. Addiotional reason for me is it is a developer option.

OK, I'll wait for allow_in_place_tablespaces to be backpatched then.

I would like to get this fix pushed before the next set of minors, so if
you won't have time for the backpatches early enough, maybe I can work
on getting it done.

Which commits would we consider?

I wonder how crazy it would be to back-patch
src/test/recovery/t/027_stream_regress.pl too.

#119

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Kyotaro Horiguchi (#117)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-22, Kyotaro Horiguchi wrote:

At Fri, 22 Jul 2022 10:18:58 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

Which commits would we consider?

7170f2159fb2 Allow "in place" tablespaces.
f6f0db4d6240 Fix pg_tablespace_location() with in-place tablespaces

The second one is just to make the function work with in-place
tablespaces. Without it the function yeilds the following error.

ERROR: could not read symbolic link "pg_tblspc/16407": Invalid argument

This looks actually odd but I think no need of back-patching because
there's no actual user of the feature is not seen in our test suite.
If we have a test that needs the feature in future, it would be enough
to back-patch it then.

Actually, I found that the new test added by the fix in this thread does
depend on this being fixed, so I included an even larger set, which I
think makes this more complete:

7170f2159fb2 Allow "in place" tablespaces.
c6f2f01611d4 Fix pg_basebackup with in-place tablespaces.
f6f0db4d6240 Fix pg_tablespace_location() with in-place tablespaces
7a7cd84893e0 doc: Remove mention to in-place tablespaces for pg_tablespace_location()
5344723755bd Remove unnecessary Windows-specific basebackup code.

I didn't include any of the test changes for now. I don't intend to do
so, unless we see another reason for that; I think the new tests that
are going to be added by the recovery bugfix should be sufficient
coverage.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

#120

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Alvaro Herrera (#105)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Okay, I think I'm done with this. Here's v27 for the master branch,
where I fixed some comments as well as thinkos in the test script.
The ones on older branches aren't materially different, they just have
tonnes of conflicts resolved. I'll get this pushed tomorrow morning.

I have run it through CI and it seems ... not completely broken, at
least, but I have no working recipes for Windows on branches 14 and
older, so it doesn't really work fully. If anybody does, please share.
You can see mine here
https://github.com/alvherre/postgres/commits/REL_11_STABLE [etc]
https://cirrus-ci.com/build/5320904228995072
https://cirrus-ci.com/github/alvherre/postgres

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

Attachments:

v27-0001-Fix-replay-of-create-database-records-on-standby.patchtext/x-diff; charset=utf-8Download

From b84f66975d664c45babd878a43c67601b7f7d2b6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 27 Jul 2022 20:22:21 +0200
Subject: [PATCH v27] Fix replay of create database records on standby
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Crash recovery on standby may encounter missing directories
when replaying database-creation WAL records.  Prior to this
patch, the standby would fail to recover in such a case;
however, the directories could be legitimately missing.
Consider the following sequence of commands:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the
tablespace directory, the standby crashes and has to replay the
create database record again, crash recovery must be able to continue.

A fix for this problem was already attempted in 49d9cfc68bf4, but it
was reverted because of design issues.  This new version is based
on Robert Haas' proposal: any missing tablespaces are created
during recovery before reaching consistency.  Tablespaces
are created as real directories, and should be deleted
by later replay.  CheckRecoveryConsistency ensures
they have disappeared.

The problems detected by this new code are reported as PANIC,
except when allow_in_place_tablespaces is set to ON, in which
case they are WARNING.  Apart from making tests possible, this
gives users an escape hatch in case things don't go as planned.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Author: Paul Guo <paulguo@gmail.com>
Reviewed-by: Anastasia Lubennikova <lubennikovaav@gmail.com> (older versions)
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> (older versions)
Reviewed-by: MichaÃ«l Paquier <michael@paquier.xyz>
Diagnosed-by: Paul Guo <paulguo@gmail.com>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c   |  50 ++++++
 src/backend/commands/dbcommands.c           |  77 +++++++++
 src/backend/commands/tablespace.c           |  40 ++---
 src/test/recovery/t/033_replay_tsp_drops.pl | 169 ++++++++++++++++++++
 4 files changed, 305 insertions(+), 31 deletions(-)
 create mode 100644 src/test/recovery/t/033_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index e383c2123a..27e02fbfcd 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
 #include "access/xlogutils.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
@@ -2008,6 +2009,47 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Verify that, in non-test mode, ./pg_tblspc doesn't contain any real
+ * directories.
+ *
+ * Replay of database creation XLOG records for databases that were later
+ * dropped can create fake directories in pg_tblspc.  By the time consistency
+ * is reached these directories should have been removed; here we verify
+ * that this did indeed happen.  This is to be called at the point where
+ * consistent state is reached.
+ *
+ * allow_in_place_tablespaces turns the PANIC into a WARNING, which is
+ * useful for testing purposes, and also allows for an escape hatch in case
+ * things go south.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir("pg_tblspc");
+	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)
+	{
+		char		path[MAXPGPATH + 10];
+
+		/* Skip entries of non-oid names */
+		if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+			continue;
+
+		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);
+
+		if (get_dirent_type(path, de, false, ERROR) != PGFILETYPE_LNK)
+			ereport(allow_in_place_tablespaces ? WARNING : PANIC,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("unexpected directory entry \"%s\" found in %s",
+							de->d_name, "pg_tblspc/"),
+					 errdetail("All directory entries in pg_tblspc/ should be symbolic links."),
+					 errhint("Remove those directories, or set allow_in_place_tablespaces to ON transiently to let recovery complete.")));
+	}
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2068,6 +2110,14 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check that pg_tblspc doesn't contain any real directories. Replay
+		 * of Database/CREATE_* records may have created ficticious tablespace
+		 * directories that should have been removed by the time consistency
+		 * was reached.
+		 */
+		CheckTablespaceDirectory();
+
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 099d369b2f..95844bbb69 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -47,6 +48,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -62,6 +64,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/guc.h"
 #include "utils/pg_locale.h"
 #include "utils/relmapper.h"
 #include "utils/snapmgr.h"
@@ -135,6 +138,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
 										Oid dst_tsid);
+static void recovery_create_dbdir(char *path, bool only_tblspc);
 
 /*
  * Create a new database using the WAL_LOG strategy.
@@ -2995,6 +2999,45 @@ get_database_name(Oid dbid)
 	return result;
 }
 
+/*
+ * recovery_create_dbdir()
+ *
+ * During recovery, there's a case where we validly need to recover a missing
+ * tablespace directory so that recovery can continue.  This happens when
+ * recovery wants to create a database but the holding tablespace has been
+ * removed before the server stopped.  Since we expect that the directory will
+ * be gone before reaching recovery consistency, and we have no knowledge about
+ * the tablespace other than its OID here, we create a real directory under
+ * pg_tblspc here instead of restoring the symlink.
+ *
+ * If only_tblspc is true, then the requested directory must be in pg_tblspc/
+ */
+static void
+recovery_create_dbdir(char *path, bool only_tblspc)
+{
+	struct stat st;
+
+	Assert(RecoveryInProgress());
+
+	if (stat(path, &st) == 0)
+		return;
+
+	if (only_tblspc && strstr(path, "pg_tblspc/") == NULL)
+		elog(PANIC, "requested to created invalid directory: %s", path);
+
+	if (reachedConsistency && !allow_in_place_tablespaces)
+		ereport(PANIC,
+				errmsg("missing directory \"%s\"", path));
+
+	elog(reachedConsistency ? WARNING : DEBUG1,
+		 "creating missing directory: %s", path);
+
+	if (pg_mkdir_p(path, pg_dir_create_mode) != 0)
+		ereport(PANIC,
+				errmsg("could not create missing directory \"%s\": %m", path));
+}
+
+
 /*
  * DATABASE resource manager's routines
  */
@@ -3012,6 +3055,7 @@ dbase_redo(XLogReaderState *record)
 		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
+		char	   *parent_path;
 		struct stat st;
 
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
@@ -3031,6 +3075,33 @@ dbase_redo(XLogReaderState *record)
 								dst_path)));
 		}
 
+		/*
+		 * If the parent of the target path doesn't exist, create it now. This
+		 * enables us to create the target underneath later.
+		 */
+		parent_path = pstrdup(dst_path);
+		get_parent_directory(parent_path);
+		if (stat(parent_path, &st) < 0)
+		{
+			if (errno != ENOENT)
+				ereport(FATAL,
+						errmsg("could not stat directory \"%s\": %m",
+							   dst_path));
+
+			/* create the parent directory if needed and valid */
+			recovery_create_dbdir(parent_path, true);
+		}
+		pfree(parent_path);
+
+		/*
+		 * There's a case where the copy source directory is missing for the
+		 * same reason above.  Create the emtpy source directory so that
+		 * copydir below doesn't fail.  The directory will be dropped soon by
+		 * recovery.
+		 */
+		if (stat(src_path, &st) < 0 && errno == ENOENT)
+			recovery_create_dbdir(src_path, false);
+
 		/*
 		 * Force dirty buffers out to disk, to ensure source database is
 		 * up-to-date for the copy.
@@ -3055,9 +3126,15 @@ dbase_redo(XLogReaderState *record)
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
 		char	   *dbpath;
+		char	   *parent_path;
 
 		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		/* create the parent directory if needed and valid */
+		parent_path = pstrdup(dbpath);
+		get_parent_directory(parent_path);
+		recovery_create_dbdir(parent_path, true);
+
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index cb7d46089a..570ce3dbd5 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -156,8 +156,6 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 				/* Directory creation failed? */
 				if (MakePGDirectory(dir) < 0)
 				{
-					char	   *parentdir;
-
 					/* Failure other than not exists or not in WAL replay? */
 					if (errno != ENOENT || !isRedo)
 						ereport(ERROR,
@@ -166,36 +164,16 @@ TablespaceCreateDbspace(Oid spcOid, Oid dbOid, bool isRedo)
 										dir)));
 
 					/*
-					 * Parent directories are missing during WAL replay, so
-					 * continue by creating simple parent directories rather
-					 * than a symlink.
+					 * During WAL replay, it's conceivable that several levels
+					 * of directories are missing if tablespaces are dropped
+					 * further ahead of the WAL stream than we're currently
+					 * replaying.  An easy way forward is to create them as
+					 * plain directories and hope they are removed by further
+					 * WAL replay if necessary.  If this also fails, there is
+					 * trouble we cannot get out of, so just report that and
+					 * bail out.
 					 */
-
-					/* create two parents up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* create one parent up if not exist */
-					parentdir = pstrdup(dir);
-					get_parent_directory(parentdir);
-					/* Can't create parent and it doesn't already exist? */
-					if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not create directory \"%s\": %m",
-										parentdir)));
-					pfree(parentdir);
-
-					/* Create database directory */
-					if (MakePGDirectory(dir) < 0)
+					if (pg_mkdir_p(dir, pg_dir_create_mode) < 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not create directory \"%s\": %m",
diff --git a/src/test/recovery/t/033_replay_tsp_drops.pl b/src/test/recovery/t/033_replay_tsp_drops.pl
new file mode 100644
index 0000000000..9b74cb09ac
--- /dev/null
+++ b/src/test/recovery/t/033_replay_tsp_drops.pl
@@ -0,0 +1,169 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test replay of tablespace/database creation/drop
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_tablespace
+{
+	my ($strategy) = @_;
+
+	my $node_primary = PostgreSQL::Test::Cluster->new("primary1_$strategy");
+	$node_primary->init(allows_streaming => 1);
+	$node_primary->start;
+	$node_primary->psql(
+		'postgres',
+		qq[
+			SET allow_in_place_tablespaces=on;
+			CREATE TABLESPACE dropme_ts1 LOCATION '';
+			CREATE TABLESPACE dropme_ts2 LOCATION '';
+			CREATE TABLESPACE source_ts  LOCATION '';
+			CREATE TABLESPACE target_ts  LOCATION '';
+			CREATE DATABASE template_db IS_TEMPLATE = true;
+		]);
+	my $backup_name = 'my_backup';
+	$node_primary->backup($backup_name);
+
+	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
+	$node_standby->init_from_backup($node_primary, $backup_name,
+		has_streaming => 1);
+	$node_standby->append_conf('postgresql.conf',
+		"allow_in_place_tablespaces = on");
+	$node_standby->start;
+
+	# Make sure connection is made
+	$node_primary->poll_query_until('postgres',
+		'SELECT count(*) = 1 FROM pg_stat_replication');
+
+	$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+	# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+	# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+	# to be applied to already-removed directories.
+	my $query = q[
+		CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1 STRATEGY=<STRATEGY>;
+		CREATE TABLE t (a int) TABLESPACE dropme_ts2;
+		CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2 STRATEGY=<STRATEGY>;
+		CREATE DATABASE moveme_db TABLESPACE source_ts STRATEGY=<STRATEGY>;
+		ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+		CREATE DATABASE newdb TEMPLATE template_db STRATEGY=<STRATEGY>;
+		ALTER DATABASE template_db IS_TEMPLATE = false;
+		DROP DATABASE dropme_db1;
+		DROP TABLE t;
+		DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+		DROP TABLESPACE source_ts;
+		DROP DATABASE template_db;
+	];
+
+	$query =~ s/<STRATEGY>/$strategy/g;
+	$node_primary->safe_psql('postgres', $query);
+	$node_primary->wait_for_catchup($node_standby, 'replay',
+		$node_primary->lsn('write'));
+
+	# show "create missing directory" log message
+	$node_standby->safe_psql('postgres',
+		"ALTER SYSTEM SET log_min_messages TO debug1;");
+	$node_standby->stop('immediate');
+	# Should restart ignoring directory creation error.
+	is($node_standby->start(fail_ok => 1),
+		1, "standby node started for $strategy");
+	$node_standby->stop('immediate');
+}
+
+test_tablespace("FILE_COPY");
+test_tablespace("WAL_LOG");
+
+# Ensure that a missing tablespace directory during create database
+# replay immediately causes panic if the standby has already reached
+# consistent state (archive recovery is in progress).  This is
+# effective only for CREATE DATABASE WITH STRATEGY=FILE_COPY.
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary2');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Create tablespace
+$node_primary->safe_psql(
+	'postgres', q[
+		SET allow_in_place_tablespaces=on;
+		CREATE TABLESPACE ts1 LOCATION ''
+			]);
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE db1 WITH TABLESPACE ts1 STRATEGY=FILE_COPY");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby3');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+	"allow_in_place_tablespaces = on");
+$node_standby->start;
+
+# Make sure standby reached consistency and starts accepting connections
+$node_standby->poll_query_until('postgres', 'SELECT 1', '1');
+
+# Remove standby tablespace directory so it will be missing when
+# replay resumes.
+my $tspoid = $node_standby->safe_psql('postgres',
+	"SELECT oid FROM pg_tablespace WHERE spcname = 'ts1';");
+my $tspdir = $node_standby->data_dir . "/pg_tblspc/$tspoid";
+File::Path::rmtree($tspdir);
+
+my $logstart = get_log_size($node_standby);
+
+# Create a database in the tablespace and a table in default tablespace
+$node_primary->safe_psql(
+	'postgres',
+	q[
+		CREATE TABLE should_not_replay_insertion(a int);
+		CREATE DATABASE db2 WITH TABLESPACE ts1 STRATEGY=FILE_COPY;
+		INSERT INTO should_not_replay_insertion VALUES (1);
+	]);
+
+# Standby should fail and should not silently skip replaying the wal
+# In this test, PANIC turns into WARNING by allow_in_place_tablespaces.
+# Check the log messages instead of confirming standby failure.
+my $max_attempts = $PostgreSQL::Test::Utils::timeout_default;
+while ($max_attempts-- >= 0)
+{
+	last
+	  if (
+		find_in_log(
+			$node_standby, "WARNING:  creating missing directory: pg_tblspc/",
+			$logstart));
+	sleep 1;
+}
+ok($max_attempts > 0, "invalid directory creation is detected");
+
+done_testing();
+
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.30.2

#121

Matthias van de Meent

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#120)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, 27 Jul 2022 at 20:55, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Okay, I think I'm done with this. Here's v27 for the master branch,
where I fixed some comments as well as thinkos in the test script.
The ones on older branches aren't materially different, they just have
tonnes of conflicts resolved. I'll get this pushed tomorrow morning.

I have run it through CI and it seems ... not completely broken, at
least, but I have no working recipes for Windows on branches 14 and
older, so it doesn't really work fully. If anybody does, please share.
You can see mine here
https://github.com/alvherre/postgres/commits/REL_11_STABLE [etc]
https://cirrus-ci.com/build/5320904228995072
https://cirrus-ci.com/github/alvherre/postgres

I'd like to bring to your attention that the test that was introduced
with 9e4f914b seem to be flaky in FreeBSD 13 in the CFBot builds: it
sometimes times out while waiting for the secondary to catch up. Or,
at least I think it does, and I'm not too familiar with TAP failure
outputs: it returns with error code 29 and logs that I'd expect when
the timeout is reached.

See bottom for examples (all 3 builds for different patches).

Kind regards,

Matthias van de Meent.

[1]: https://cirrus-ci.com/task/4960990331666432?logs=test_world#L2631-L2662
[2]: https://cirrus-ci.com/task/5012678384025600?logs=test_world#L2631-L2662
[3]: https://cirrus-ci.com/task/5147001137397760?logs=test_world#L2631-L2662

#122

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Matthias van de Meent (#121)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

I'd like to bring to your attention that the test that was introduced
with 9e4f914b seem to be flaky in FreeBSD 13 in the CFBot builds: it
sometimes times out while waiting for the secondary to catch up. Or,
at least I think it does, and I'm not too familiar with TAP failure
outputs: it returns with error code 29 and logs that I'd expect when
the timeout is reached.

It's also failing in the buildfarm, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-07-28%2020%3A57%3A50

Looks like only conchuela so far, reinforcing the idea that we're
only seeing it on FreeBSD. I'd tentatively bet on a timing problem
that requires some FreeBSD scheduling quirk to manifest; we've seen
such quirks before.

regards, tom lane

#123

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#122)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Fri, Jul 29, 2022 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

I'd like to bring to your attention that the test that was introduced
with 9e4f914b seem to be flaky in FreeBSD 13 in the CFBot builds: it
sometimes times out while waiting for the secondary to catch up. Or,
at least I think it does, and I'm not too familiar with TAP failure
outputs: it returns with error code 29 and logs that I'd expect when
the timeout is reached.

It's also failing in the buildfarm, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-07-28%2020%3A57%3A50

Looks like only conchuela so far, reinforcing the idea that we're
only seeing it on FreeBSD. I'd tentatively bet on a timing problem
that requires some FreeBSD scheduling quirk to manifest; we've seen
such quirks before.

Maybe it just needs a replication slot? I see:

ERROR: requested WAL segment 000000010000000000000003 has already been removed

#124

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Thomas Munro (#123)

1 attachment(s)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Fri, 29 Jul 2022 11:27:01 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

Maybe it just needs a replication slot? I see:

ERROR: requested WAL segment 000000010000000000000003 has already been removed

Agreed, I see the same. The same failure can be surely reproducible
by inserting wal-switch+checkpoint after taking backup [1]--- a/src/test/recovery/t/033_replay_tsp_drops.pl +++ b/src/test/recovery/t/033_replay_tsp_drops.pl @@ -30,6 +30,13 @@ sub test_tablespace my $backup_name = 'my_backup'; $node_primary->backup($backup_name);. And it is
fixed by the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

[1]:
--- a/src/test/recovery/t/033_replay_tsp_drops.pl
+++ b/src/test/recovery/t/033_replay_tsp_drops.pl
@@ -30,6 +30,13 @@ sub test_tablespace
 	my $backup_name = 'my_backup';
 	$node_primary->backup($backup_name);

+	$node_primary->psql(
+		'postgres',
+		qq[
+		CREATE TABLE t(); DROP TABLE t; SELECT pg_switch_wal();
+		CHECKPOINT;
+		]);
+
 	my $node_standby = PostgreSQL::Test::Cluster->new("standby2_$strategy");
 	$node_standby->init_from_backup($node_primary, $backup_name,
 		has_streaming => 1);

Attachments:

fix_tsp_drop_test_error.difftext/x-patch; charset=us-asciiDownload

diff --git a/src/test/recovery/t/033_replay_tsp_drops.pl b/src/test/recovery/t/033_replay_tsp_drops.pl
index 9b74cb09ac..0756ca6c87 100644
--- a/src/test/recovery/t/033_replay_tsp_drops.pl
+++ b/src/test/recovery/t/033_replay_tsp_drops.pl
@@ -20,6 +20,7 @@ sub test_tablespace
 	$node_primary->psql(
 		'postgres',
 		qq[
+			SELECT pg_create_physical_replication_slot('slot1', true);
 			SET allow_in_place_tablespaces=on;
 			CREATE TABLESPACE dropme_ts1 LOCATION '';
 			CREATE TABLESPACE dropme_ts2 LOCATION '';

#125

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Kyotaro Horiguchi (#124)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-29, Kyotaro Horiguchi wrote:

At Fri, 29 Jul 2022 11:27:01 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

Maybe it just needs a replication slot? I see:

ERROR: requested WAL segment 000000010000000000000003 has already been removed

Agreed, I see the same. The same failure can be surely reproducible
by inserting wal-switch+checkpoint after taking backup [1]. And it is
fixed by the attached.

WFM, pushed that way. I added a slot drop after the pg_stat_replication
count check to be a little less intrusive. Thanks Matthias for
reporting. (Note that the Cirrus page has a download link for the
complete logs as artifacts).

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
(Linus Torvalds)

#126

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Alvaro Herrera (#125)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

WFM, pushed that way.

Looks like conchuela is still intermittently unhappy.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-07-30%2004%3A57%3A51

regards, tom lane

#127

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#126)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

I wrote:

Looks like conchuela is still intermittently unhappy.

BTW, quite aside from stability, is it really necessary for this test to
be so freakin' slow? florican for instance reports

[12:43:38] t/025_stuck_on_old_timeline.pl ....... ok 49010 ms ( 0.00 usr 0.00 sys + 3.64 cusr 2.49 csys = 6.13 CPU)
[12:44:12] t/026_overwrite_contrecord.pl ........ ok 34751 ms ( 0.01 usr 0.00 sys + 3.14 cusr 1.76 csys = 4.91 CPU)
[12:49:00] t/027_stream_regress.pl .............. ok 287278 ms ( 0.00 usr 0.00 sys + 9.66 cusr 6.95 csys = 16.60 CPU)
[12:50:04] t/028_pitr_timelines.pl .............. ok 64543 ms ( 0.00 usr 0.00 sys + 3.59 cusr 3.20 csys = 6.78 CPU)
[12:50:17] t/029_stats_restart.pl ............... ok 12505 ms ( 0.02 usr 0.00 sys + 3.16 cusr 1.40 csys = 4.57 CPU)
[12:50:51] t/030_stats_cleanup_replica.pl ....... ok 33933 ms ( 0.01 usr 0.01 sys + 3.55 cusr 2.46 csys = 6.03 CPU)
[12:51:25] t/031_recovery_conflict.pl ........... ok 34249 ms ( 0.00 usr 0.00 sys + 3.37 cusr 2.20 csys = 5.57 CPU)
[12:52:09] t/032_relfilenode_reuse.pl ........... ok 44274 ms ( 0.01 usr 0.00 sys + 3.21 cusr 2.05 csys = 5.27 CPU)
[12:54:07] t/033_replay_tsp_drops.pl ............ ok 117840 ms ( 0.01 usr 0.00 sys + 8.72 cusr 5.41 csys = 14.14 CPU)

027 is so bloated because it runs the core regression tests YA time,
which I'm not very happy about either; but that's no excuse for
every new test to contribute an additional couple of minutes.

regards, tom lane

#128

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#127)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Sun, Jul 31, 2022 at 4:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, quite aside from stability, is it really necessary for this test to
be so freakin' slow? florican for instance reports

[12:43:38] t/025_stuck_on_old_timeline.pl ....... ok 49010 ms ( 0.00 usr 0.00 sys + 3.64 cusr 2.49 csys = 6.13 CPU)
[12:44:12] t/026_overwrite_contrecord.pl ........ ok 34751 ms ( 0.01 usr 0.00 sys + 3.14 cusr 1.76 csys = 4.91 CPU)
[12:49:00] t/027_stream_regress.pl .............. ok 287278 ms ( 0.00 usr 0.00 sys + 9.66 cusr 6.95 csys = 16.60 CPU)
[12:50:04] t/028_pitr_timelines.pl .............. ok 64543 ms ( 0.00 usr 0.00 sys + 3.59 cusr 3.20 csys = 6.78 CPU)
[12:50:17] t/029_stats_restart.pl ............... ok 12505 ms ( 0.02 usr 0.00 sys + 3.16 cusr 1.40 csys = 4.57 CPU)
[12:50:51] t/030_stats_cleanup_replica.pl ....... ok 33933 ms ( 0.01 usr 0.01 sys + 3.55 cusr 2.46 csys = 6.03 CPU)
[12:51:25] t/031_recovery_conflict.pl ........... ok 34249 ms ( 0.00 usr 0.00 sys + 3.37 cusr 2.20 csys = 5.57 CPU)
[12:52:09] t/032_relfilenode_reuse.pl ........... ok 44274 ms ( 0.01 usr 0.00 sys + 3.21 cusr 2.05 csys = 5.27 CPU)
[12:54:07] t/033_replay_tsp_drops.pl ............ ok 117840 ms ( 0.01 usr 0.00 sys + 8.72 cusr 5.41 csys = 14.14 CPU)

027 is so bloated because it runs the core regression tests YA time,
which I'm not very happy about either; but that's no excuse for
every new test to contribute an additional couple of minutes.

Complaints about 027 noted, I'm thinking about what we could do about that.

As for 033, I worried that it might be the new ProcSignalBarrier stuff
around tablespaces, but thankfully the DEBUG logging I added there
recently shows those all completing in single digit milliseconds. I
also confirmed there are no unexpected fsync'd being produced here.

That is quite a lot of CPU, but it's a huge amount of total runtime.
It runs in 5-8 seconds on various modern systems, 19 seconds on my
Linux RPi4, and 50 seconds on my Celeron-powered NAS box with spinning
disks.

I noticed this is a 32 bit FBSD system. Is it running on UFS, perhaps
on slow storage? Are soft updates enabled (visible as options in
output of "mount")? Without soft updates, a lot more file system ops
perform synchronous I/O, which really slows down our tests. In
general, UFS isn't as good as modern file systems at avoiding I/O for
short-lived files, and we set up and tear down a lot of them in our
testing. Another thing that makes a difference is to use a filesystem
with 8KB block size. This has been a subject of investigation for
speeding up CI (see src/tools/ci/gcp_freebsd_repartition.sh), but
several mysteries remain unsolved...

#129

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#128)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thomas Munro <thomas.munro@gmail.com> writes:

I noticed this is a 32 bit FBSD system. Is it running on UFS, perhaps
on slow storage? Are soft updates enabled (visible as options in
output of "mount")?

It's an ancient (2006) mac mini with 5400RPM spinning rust.
"mount" says

/dev/ada0s2a on / (ufs, local, soft-updates, journaled soft-updates)
devfs on /dev (devfs)

regards, tom lane

#130

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#126)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Sun, Jul 31, 2022 at 2:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

WFM, pushed that way.

Looks like conchuela is still intermittently unhappy.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-07-30%2004%3A57%3A51

And here's one from CI that failed on Linux (this was a cfbot run with
an unrelated patch, parent commit b998196 so a few commits after "Fix
test instability"):

https://cirrus-ci.com/task/5282155000496128

https://api.cirrus-ci.com/v1/artifact/task/5282155000496128/log/src/test/recovery/tmp_check/log/033_replay_tsp_drops_primary1_WAL_LOG.log

It looks like this sequence is racy and we need to wait for more than
just "connection is made" before dropping the slot?

$node_standby->start;

# Make sure connection is made
$node_primary->poll_query_until('postgres',
'SELECT count(*) = 1 FROM pg_stat_replication');
$node_primary->safe_psql('postgres', "SELECT
pg_drop_replication_slot('slot')");

Why not set the replication slot name so that the standby uses it
"properly", like in other tests?

#131

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#126)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi,

On 2022-07-30 10:37:55 -0400, Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

WFM, pushed that way.

Looks like conchuela is still intermittently unhappy.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-07-30%2004%3A57%3A51

CI as well:
https://cirrus-ci.com/task/5295464063959040?logs=test_world#L2671
https://cirrus-ci.com/task/5042590885085184?logs=test_world#L2664

Greetings,

Andres Freund

#132

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#130)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Sun, Jul 31, 2022 at 3:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Jul 31, 2022 at 2:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

WFM, pushed that way.

Looks like conchuela is still intermittently unhappy.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-07-30%2004%3A57%3A51

And here's one from CI that failed on Linux (this was a cfbot run with
an unrelated patch, parent commit b998196 so a few commits after "Fix
test instability"):

https://cirrus-ci.com/task/5282155000496128

https://api.cirrus-ci.com/v1/artifact/task/5282155000496128/log/src/test/recovery/tmp_check/log/033_replay_tsp_drops_primary1_WAL_LOG.log

It looks like this sequence is racy and we need to wait for more than
just "connection is made" before dropping the slot?

$node_standby->start;

# Make sure connection is made
$node_primary->poll_query_until('postgres',
'SELECT count(*) = 1 FROM pg_stat_replication');
$node_primary->safe_psql('postgres', "SELECT
pg_drop_replication_slot('slot')");

Why not set the replication slot name so that the standby uses it
"properly", like in other tests?

Or to keep doing it this way, does that pg_stat_replication query need
a WHERE clause looking at the state?

#133

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#129)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Sun, Jul 31, 2022 at 11:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I noticed this is a 32 bit FBSD system. Is it running on UFS, perhaps
on slow storage? Are soft updates enabled (visible as options in
output of "mount")?

It's an ancient (2006) mac mini with 5400RPM spinning rust.
"mount" says

/dev/ada0s2a on / (ufs, local, soft-updates, journaled soft-updates)
devfs on /dev (devfs)

I don't have all the details and I may be way off here but I have the
impression that when you create and then unlink trees of files
quickly, sometimes soft-updates are flushed synchronously, which turns
into many 5400 RPM seeks; dtrace could be used to check, but some
clues in your numbers would be some kind of correlation between time
and number of clusters that are set up and torn down by each test.
Without soft-updates, it'd be much worse, because then many more
things become synchronous I/O. Even with write caching enabled,
soft-updates flush the drive cache when there's a barrier needed for
crash safety. It may also be that there is something strange about
Apple hardware that makes it extra slow at full-cache-flush operations
(cf unexplainable excess slowness of F_FULLFSYNC under macOS including
old spinning rust systems and current flash systems, and complaints
about this general area on current Apple hardware from the Asahi
Linux/M1 port people, though how relevant that is to 2006 spinning
rust I dunno). It would be nice to look into how to tune, fix or work
around all of that, as it also affects CI which has a IO limits
(though admittedly a couple of orders of mag higher IOPS than 5400
RPM).

#134

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Tom Lane (#127)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On 2022-Jul-30, Tom Lane wrote:

BTW, quite aside from stability, is it really necessary for this test to
be so freakin' slow? florican for instance reports

[12:54:07] t/033_replay_tsp_drops.pl ............ ok 117840 ms ( 0.01 usr 0.00 sys + 8.72 cusr 5.41 csys = 14.14 CPU)

027 is so bloated because it runs the core regression tests YA time,
which I'm not very happy about either; but that's no excuse for
every new test to contribute an additional couple of minutes.

Definitely not intended. It looks like the reason is just that the DROP
DATABASE/TABLESPACE commands are super slow, and this test does a lot of
that. I added some instrumentation and the largest fraction of time
goes to execute this

CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
CREATE TABLE t (a int) TABLESPACE dropme_ts2;
CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
CREATE DATABASE moveme_db TABLESPACE source_ts;
ALTER DATABASE moveme_db SET TABLESPACE target_ts;
CREATE DATABASE newdb TEMPLATE template_db;
ALTER DATABASE template_db IS_TEMPLATE = false;
DROP DATABASE dropme_db1;
DROP TABLE t;
DROP DATABASE dropme_db2;
DROP TABLESPACE dropme_ts2;
DROP TABLESPACE source_ts;
DROP DATABASE template_db;

Maybe this is overkill and we can reduce the test without damaging the
coverage. I'll have a look during the weekend.

I'll repair the reliability problem too, separately.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"This is a foot just waiting to be shot" (Andrew Dunstan)