replay of CREATE TABLESPACE eats data at wal_level=minimal

Started by Robert Haasover 4 years ago13 messages
#1Robert Haas
robertmhaas@gmail.com

To reproduce, initialize a cluster with wal_level=minimal and
max_wal_senders=0. Then from psql:

\! mkdir /tmp/goose

CHECKPOINT;
CREATE TABLESPACE goose LOCATION '/tmp/goose';
SET wal_skip_threshold=0;
BEGIN;
CREATE TABLE wild (a int, b text) TABLESPACE goose;
INSERT INTO wild VALUES (1, 'chase');
COMMIT;
SELECT * FROM wild;

As expected, you will see one row in table 'wild'. Now perform an
immediate shutdown. Restart the server. Table 'wild' is now empty. The
problem appears to be that tblspc_redo() calls
create_tablespace_directories(), which says:

/*
* Our theory for replaying a CREATE is to forcibly drop the target
* subdirectory if present, and then recreate it. This may be more
* work than needed, but it is simple to implement.
*/

Unfortunately, this theory (which dates to
c86f467d18aa58e18fd85b560b46d8de014e6017, vintage 2010, by Bruce) is
correct only with wal_level>minimal. At wal_level='minimal', we can
replay the record to recreate the relfilenode, but not the records
that would have created the contents. However, note that if the table
is smaller than wal_skip_threshold, then we'll log full-page images of
the contents at commit time even at wal_level='minimal' after which we
have no problem. As far as I can see, this bug has "always" existed,
but before c6b92041d38512a4176ed76ad06f713d2e6c01a8 (2020, Noah) you
would have needed a different test case. Specifically, you would have
needed to use COPY to put the row in the table, and you would have
needed to omit setting wal_skip_threshold since it didn't exist yet.

I don't presently have a specific idea about how to fix this.

--
Robert Haas
EDB: http://www.enterprisedb.com

#2Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#1)
1 attachment(s)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Mon, Aug 09, 2021 at 01:08:42PM -0400, Robert Haas wrote:

To reproduce, initialize a cluster with wal_level=minimal and
max_wal_senders=0. Then from psql:

\! mkdir /tmp/goose

CHECKPOINT;
CREATE TABLESPACE goose LOCATION '/tmp/goose';
SET wal_skip_threshold=0;
BEGIN;
CREATE TABLE wild (a int, b text) TABLESPACE goose;
INSERT INTO wild VALUES (1, 'chase');
COMMIT;
SELECT * FROM wild;

As expected, you will see one row in table 'wild'. Now perform an
immediate shutdown. Restart the server. Table 'wild' is now empty.

Thanks for finding the problem. It's a bad problem.

The problem appears to be that tblspc_redo() calls
create_tablespace_directories(), which says:

/*
* Our theory for replaying a CREATE is to forcibly drop the target
* subdirectory if present, and then recreate it. This may be more
* work than needed, but it is simple to implement.
*/

Unfortunately, this theory (which dates to
c86f467d18aa58e18fd85b560b46d8de014e6017, vintage 2010, by Bruce) is
correct only with wal_level>minimal. At wal_level='minimal', we can
replay the record to recreate the relfilenode, but not the records
that would have created the contents. However, note that if the table
is smaller than wal_skip_threshold, then we'll log full-page images of
the contents at commit time even at wal_level='minimal' after which we
have no problem. As far as I can see, this bug has "always" existed,
but before c6b92041d38512a4176ed76ad06f713d2e6c01a8 (2020, Noah) you
would have needed a different test case. Specifically, you would have
needed to use COPY to put the row in the table, and you would have
needed to omit setting wal_skip_threshold since it didn't exist yet.

Agreed.

I don't presently have a specific idea about how to fix this.

Can't recovery just not delete the directory, create it if doesn't exist, and
be happy if it does exist? Like the attached WIP. If we think it's possible
for a crash during mkdir to leave a directory having the wrong permissions,
adding a chmod would be in order.

Attachments:

XLOG_TBLSPC_CREATE-minimal-v0.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index a54239a..2928ecf 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -589,7 +589,6 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
 {
 	char	   *linkloc;
 	char	   *location_with_version_dir;
-	struct stat st;
 
 	linkloc = psprintf("pg_tblspc/%u", tablespaceoid);
 	location_with_version_dir = psprintf("%s/%s", location,
@@ -614,39 +613,22 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
 							location)));
 	}
 
-	if (InRecovery)
-	{
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, and then recreate it. This may be more
-		 * work than needed, but it is simple to implement.
-		 */
-		if (stat(location_with_version_dir, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(location_with_version_dir, true))
-				/* If this failed, MakePGDirectory() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								location_with_version_dir)));
-		}
-	}
-
 	/*
 	 * The creation of the version directory prevents more than one tablespace
 	 * in a single location.
 	 */
 	if (MakePGDirectory(location_with_version_dir) < 0)
 	{
-		if (errno == EEXIST)
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_IN_USE),
-					 errmsg("directory \"%s\" already in use as a tablespace",
-							location_with_version_dir)));
-		else
+		if (errno != EEXIST)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not create directory \"%s\": %m",
 							location_with_version_dir)));
+		else if (!InRecovery)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_IN_USE),
+					 errmsg("directory \"%s\" already in use as a tablespace",
+							location_with_version_dir)));
 	}
 
 	/*
#3Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#2)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Mon, Aug 9, 2021 at 9:23 PM Noah Misch <noah@leadboat.com> wrote:

I don't presently have a specific idea about how to fix this.

Can't recovery just not delete the directory, create it if doesn't exist, and
be happy if it does exist? Like the attached WIP. If we think it's possible
for a crash during mkdir to leave a directory having the wrong permissions,
adding a chmod would be in order.

Oh, yeah, I think that works, actually. I was imagining a few problems
here, but I don't think they really exist. The redo routines for files
within the directory can't possibly care about having the old files
erased for them, since that wouldn't be something that would normally
happen, if there were no recent CREATE TABLESPACE involved. And
there's code further down to remove and recreate the symlink, just in
case. So I think your proposed patch might be all we need.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#3)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Tue, Aug 10, 2021 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote:

Oh, yeah, I think that works, actually. I was imagining a few problems
here, but I don't think they really exist. The redo routines for files
within the directory can't possibly care about having the old files
erased for them, since that wouldn't be something that would normally
happen, if there were no recent CREATE TABLESPACE involved. And
there's code further down to remove and recreate the symlink, just in
case. So I think your proposed patch might be all we need.

Noah, do you plan to commit this?

Does anyone else want to review?

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#4)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Wed, Aug 18, 2021 at 10:47:24AM -0400, Robert Haas wrote:

On Tue, Aug 10, 2021 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote:

Oh, yeah, I think that works, actually. I was imagining a few problems
here, but I don't think they really exist. The redo routines for files
within the directory can't possibly care about having the old files
erased for them, since that wouldn't be something that would normally
happen, if there were no recent CREATE TABLESPACE involved. And
there's code further down to remove and recreate the symlink, just in
case. So I think your proposed patch might be all we need.

Noah, do you plan to commit this?

Yes. I feel it needs a test case, which is the main reason I've queued the
task rather than just pushed what I posted last.

On Mon, Aug 09, 2021 at 06:23:07PM -0700, Noah Misch wrote:

If we think it's possible for a crash during mkdir to leave a directory
having the wrong permissions, adding a chmod would be in order.

I wouldn't be surprised if it's possible under some NFS versions, considering
behavior like https://www.spinics.net/lists/linux-nfs/msg59044.html exists.
However, nowhere else do we try to cope with this. Hence, the fix for
$SUBJECT shouldn't change that pattern.

#6Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#5)
1 attachment(s)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Wed, Aug 18, 2021 at 10:32:10PM -0700, Noah Misch wrote:

On Wed, Aug 18, 2021 at 10:47:24AM -0400, Robert Haas wrote:

On Tue, Aug 10, 2021 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote:

Oh, yeah, I think that works, actually. I was imagining a few problems
here, but I don't think they really exist. The redo routines for files
within the directory can't possibly care about having the old files
erased for them, since that wouldn't be something that would normally
happen, if there were no recent CREATE TABLESPACE involved. And
there's code further down to remove and recreate the symlink, just in
case. So I think your proposed patch might be all we need.

Noah, do you plan to commit this?

Yes. I feel it needs a test case, which is the main reason I've queued the
task rather than just pushed what I posted last.

Here's what I plan to push. Besides adding a test, I modified things so
CREATE TABLESPACE redo continues to report an error if a non-directory exists
under the name we seek to create. One could argue against covering that
corner case, but TablespaceCreateDbspace() does cover it.

Attachments:

XLOG_TBLSPC_CREATE-minimal-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Fix data loss in wal_level=minimal crash recovery of CREATE TABLESPACE.
    
    If the system crashed between CREATE TABLESPACE and the next checkpoint,
    the result could be some files in the tablespace unexpectedly containing
    no rows.  Affected files would be those for which the system did not
    write WAL; see the wal_skip_threshold documentation.  Before v13, a
    different set of conditions governed the writing of WAL; see v12's
    <sect2 id="populate-pitr">.  Users may want to audit non-default
    tablespaces for unexpected short files.  This bug could have truncated
    an index without affecting the associated table, and reindexing the
    index would fix that particular problem.  This fixes the bug by making
    create_tablespace_directories() more like TablespaceCreateDbspace().
    Back-patch to 9.6 (all supported versions).
    
    Reviewed by Robert Haas.
    
    Discussion: https://postgr.es/m/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zKiHxQzBbFRxxGxm0w@mail.gmail.com

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index a54239a..932e7ae 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -614,40 +614,36 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
 							location)));
 	}
 
-	if (InRecovery)
-	{
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, and then recreate it. This may be more
-		 * work than needed, but it is simple to implement.
-		 */
-		if (stat(location_with_version_dir, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(location_with_version_dir, true))
-				/* If this failed, MakePGDirectory() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								location_with_version_dir)));
-		}
-	}
-
 	/*
 	 * The creation of the version directory prevents more than one tablespace
-	 * in a single location.
+	 * in a single location.  This imitates TablespaceCreateDbspace(), but it
+	 * ignores concurrency and missing parent directories.  The chmod() would
+	 * have failed in the absence of a parent.  pg_tablespace_spcname_index
+	 * prevents concurrency.
 	 */
-	if (MakePGDirectory(location_with_version_dir) < 0)
+	if (stat(location_with_version_dir, &st) < 0)
 	{
-		if (errno == EEXIST)
+		if (errno != ENOENT)
 			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_IN_USE),
-					 errmsg("directory \"%s\" already in use as a tablespace",
+					(errcode_for_file_access(),
+					 errmsg("could not stat directory \"%s\": %m",
 							location_with_version_dir)));
-		else
+		else if (MakePGDirectory(location_with_version_dir) < 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not create directory \"%s\": %m",
 							location_with_version_dir)));
 	}
+	else if (!S_ISDIR(st.st_mode))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" exists but is not a directory",
+						location_with_version_dir)));
+	else if (!InRecovery)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+				 errmsg("directory \"%s\" already in use as a tablespace",
+						location_with_version_dir)));
 
 	/*
 	 * In recovery, remove old symlink, in case it points to the wrong place.
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index 4aa1bf8..47cbc95 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -14,7 +14,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 34;
+use Test::More tests => 38;
 
 sub check_orphan_relfilenodes
 {
@@ -59,8 +59,31 @@ wal_skip_threshold = 0
 	my $tablespace_dir = $node->basedir . '/tablespace_other';
 	mkdir($tablespace_dir);
 	$tablespace_dir = TestLib::perl2host($tablespace_dir);
-	$node->safe_psql('postgres',
-		"CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+	my $result;
+
+	# Test redo of CREATE TABLESPACE.
+	$node->safe_psql(
+		'postgres', "
+		CREATE TABLE moved (id int);
+		INSERT INTO moved VALUES (1);
+		CREATE TABLESPACE other LOCATION '$tablespace_dir';
+		BEGIN;
+		ALTER TABLE moved SET TABLESPACE other;
+		CREATE TABLE originated (id int);
+		INSERT INTO originated VALUES (1);
+		CREATE UNIQUE INDEX ON originated(id) TABLESPACE other;
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM moved;");
+	is($result, qq(1), "wal_level = $wal_level, CREATE+SET TABLESPACE");
+	$result = $node->safe_psql(
+		'postgres', "
+		INSERT INTO originated VALUES (1) ON CONFLICT (id)
+		  DO UPDATE set id = originated.id + 1
+		  RETURNING id;");
+	is($result, qq(2),
+		"wal_level = $wal_level, CREATE TABLESPACE, CREATE INDEX");
 
 	# Test direct truncation optimization.  No tuples.
 	$node->safe_psql(
@@ -71,7 +94,7 @@ wal_skip_threshold = 0
 		COMMIT;");
 	$node->stop('immediate');
 	$node->start;
-	my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
 	is($result, qq(0), "wal_level = $wal_level, TRUNCATE with empty table");
 
 	# Test truncation with inserted tuples within the same transaction.
#7Prabhat Sahu
prabhat.sahu@enterprisedb.com
In reply to: Noah Misch (#6)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Mon, Aug 23, 2021 at 4:29 AM Noah Misch <noah@leadboat.com> wrote:

On Wed, Aug 18, 2021 at 10:32:10PM -0700, Noah Misch wrote:

On Wed, Aug 18, 2021 at 10:47:24AM -0400, Robert Haas wrote:

On Tue, Aug 10, 2021 at 9:35 AM Robert Haas <robertmhaas@gmail.com>

wrote:

Oh, yeah, I think that works, actually. I was imagining a few

problems

here, but I don't think they really exist. The redo routines for

files

within the directory can't possibly care about having the old files
erased for them, since that wouldn't be something that would normally
happen, if there were no recent CREATE TABLESPACE involved. And
there's code further down to remove and recreate the symlink, just in
case. So I think your proposed patch might be all we need.

Noah, do you plan to commit this?

Yes. I feel it needs a test case, which is the main reason I've queued

the

task rather than just pushed what I posted last.

Here's what I plan to push. Besides adding a test,

I have reproduced the issue of data inconsistency with CREATE TABLESPACE at
wal_level=minimal,
also I have tested the fix with v0 and v1 patch, and come up with a similar
tap-testcase(as in v1). The test case looks good.

--

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

#8Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#6)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Sun, Aug 22, 2021 at 6:59 PM Noah Misch <noah@leadboat.com> wrote:

Here's what I plan to push. Besides adding a test, I modified things so
CREATE TABLESPACE redo continues to report an error if a non-directory exists
under the name we seek to create. One could argue against covering that
corner case, but TablespaceCreateDbspace() does cover it.

By and large, LGTM, though perhaps it would be appropriate to also
credit me as the reporter of the issue.

I feel it might be slightly better to highlight somewhere, either in
the commit message or in the comments, that removing the old directory
is unsafe, because if wal_level=minimal, we may have no other copy of
the data. For me that's the key point here. I feel that the commit
message and comments inside the patch explain rather thoroughly the
possible consequences of the bug and why this particular fix was
chosen, but they're not real explicit about why there was a bug at
all.

Thanks very much for working on this.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#8)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Mon, Aug 23, 2021 at 09:08:44AM -0400, Robert Haas wrote:

On Sun, Aug 22, 2021 at 6:59 PM Noah Misch <noah@leadboat.com> wrote:

Here's what I plan to push. Besides adding a test, I modified things so
CREATE TABLESPACE redo continues to report an error if a non-directory exists
under the name we seek to create. One could argue against covering that
corner case, but TablespaceCreateDbspace() does cover it.

By and large, LGTM, though perhaps it would be appropriate to also
credit me as the reporter of the issue.

I feel it might be slightly better to highlight somewhere, either in
the commit message or in the comments, that removing the old directory
is unsafe, because if wal_level=minimal, we may have no other copy of
the data. For me that's the key point here. I feel that the commit
message and comments inside the patch explain rather thoroughly the
possible consequences of the bug and why this particular fix was
chosen, but they're not real explicit about why there was a bug at
all.

Sounds good. I think the log message is the optimal place:

===
Fix data loss in wal_level=minimal crash recovery of CREATE TABLESPACE.

If the system crashed between CREATE TABLESPACE and the next checkpoint,
the result could be some files in the tablespace unexpectedly containing
no rows. Affected files would be those for which the system did not
write WAL; see the wal_skip_threshold documentation. Before v13, a
different set of conditions governed the writing of WAL; see v12's
<sect2 id="populate-pitr">. (The v12 conditions were broader in some
ways and narrower in others.) Users may want to audit non-default
tablespaces for unexpected short files. The bug could have truncated an
index without affecting the associated table, and reindexing the index
would fix that particular problem.

This fixes the bug by making create_tablespace_directories() more like
TablespaceCreateDbspace(). create_tablespace_directories() was
recursively removing tablespace contents, reasoning that WAL redo would
recreate everything removed that way. That assumption holds for other
wal_level values. Under wal_level=minimal, the old approach could
delete files for which no other copy existed. Back-patch to 9.6 (all
supported versions).

Reviewed by Robert Haas and Prabhat Sahu. Reported by Robert Haas.

Discussion: /messages/by-id/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zKiHxQzBbFRxxGxm0w@mail.gmail.com
===

#10Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#9)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Wed, Aug 25, 2021 at 1:21 AM Noah Misch <noah@leadboat.com> wrote:

Sounds good. I think the log message is the optimal place:

Looks awesome.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#10)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Wed, Aug 25, 2021 at 8:03 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 25, 2021 at 1:21 AM Noah Misch <noah@leadboat.com> wrote:

Sounds good. I think the log message is the optimal place:

Looks awesome.

Is there anything still standing in the way of committing this?

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#11)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Thu, Sep 02, 2021 at 11:28:27AM -0400, Robert Haas wrote:

On Wed, Aug 25, 2021 at 8:03 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 25, 2021 at 1:21 AM Noah Misch <noah@leadboat.com> wrote:

Sounds good. I think the log message is the optimal place:

Looks awesome.

Is there anything still standing in the way of committing this?

I pushed it as commit 97ddda8.

#13Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#12)
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

On Thu, Sep 2, 2021 at 10:56 PM Noah Misch <noah@leadboat.com> wrote:

Is there anything still standing in the way of committing this?

I pushed it as commit 97ddda8.

Oh, thanks. Sorry, I had missed that.

--
Robert Haas
EDB: http://www.enterprisedb.com