standby recovery fails (tablespace related) (tentative patch and discussion)

apraveen@pivotal.io

about 7 years ago

In reply to: Paul Guo (#1)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.

Asim

pguo@pivotal.io

about 7 years ago

In reply to: Asim R P (#2)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Paul Guo (#3)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

The comment for the new function XLogMakePGDirectory is wrong:

+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.

The correct one is:

+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.

It is fixed in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Mon, 22 Apr 2019 16:40:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.164027.33866403.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

michael@paquier.xyz

about 7 years ago

In reply to: Kyotaro Horiguchi (#6)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

+    # Check for symlink -- needed only on source dir
+    # (note: this will fall through quietly if file is already gone)
+    if (-l $srcpath)
+    {
+        croak "Cannot operate on symlink \"$srcpath\""
+          if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+        # We have mapped tablespaces. Copy them individually
+        my $linkname = $1;
+        my $tmpdir = TestLib::tempdir;
+        my $dstrealdir = TestLib::real_dir($tmpdir);
+        my $srcrealdir = readlink($srcpath);
+
+        opendir(my $dh, $srcrealdir);
+        while (readdir $dh)
+        {
+            next if (/^\.\.?$/);
+            my $spath = "$srcrealdir/$_";
+            my $dpath = "$dstrealdir/$_";
+
+            copypath($spath, $dpath);
+        }
+        closedir $dh;
+
+        symlink $dstrealdir, $destpath;
+        return 1;
+    }

The same stuff is proposed here:
/messages/by-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@mail.gmail.com

So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here? I am not sure
yet which one of the proposals is better than the other though.
--
Michael

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Michael Paquier (#7)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

At Tue, 23 Apr 2019 11:34:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20190423023438.GH2712@paquier.xyz>

On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.
+    # Check for symlink -- needed only on source dir
+    # (note: this will fall through quietly if file is already gone)
+    if (-l $srcpath)
+    {
+        croak "Cannot operate on symlink \"$srcpath\""
+          if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+        # We have mapped tablespaces. Copy them individually
+        my $linkname = $1;
+        my $tmpdir = TestLib::tempdir;
+        my $dstrealdir = TestLib::real_dir($tmpdir);
+        my $srcrealdir = readlink($srcpath);
+
+        opendir(my $dh, $srcrealdir);
+        while (readdir $dh)
+        {
+            next if (/^\.\.?$/);
+            my $spath = "$srcrealdir/$_";
+            my $dpath = "$dstrealdir/$_";
+
+            copypath($spath, $dpath);
+        }
+        closedir $dh;
+
+        symlink $dstrealdir, $destpath;
+        return 1;
+    }
The same stuff is proposed here:
/messages/by-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@mail.gmail.com

So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here? I am not sure
yet which one of the proposals is better than the other though.

TBH I like that (my one cieted above) not so much. However, I
prefer to have v12 if this is a bug and to be fixed in
v12. Otherwise we won't add a test for this later:p

Anyway I'll visit there. Thanks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

pguo@pivotal.io

about 7 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Show quoted text

Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in

<CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:

On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:

create db with tablespace
drop database
drop tablespace.

Essentially, that sequence of operations causes crash recovery to

fail

if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?

No. In general, checkpoint is done for

drop_db/create_db/drop_tablespace on

master.
That makes the file/directory update-to-date if I understand the

related

code correctly.
For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would

'create db with tablespace' redo log does not include the tablespace

real

directory information.
Yes, we could add in it into the xlog, but that seems to be an

overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the

drop database and drop tablespace records are replayed.

'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g.

reading

any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

The comment for the new function XLogMakePGDirectory is wrong:
+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.
The correct one is:
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
It is fixed in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Paul Guo (#9)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Kyotaro Horiguchi (#10)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

pguo@pivotal.io

about 7 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in

<CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this

database

create redo error, but I suspect some other kind of redo, which

depends on

the files under the directory (they are not copied since the directory

is

not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us

cope

| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().

2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change
could fix that.

diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
         */
        FlushDatabaseBuffers(xlrec->src_db_id);

+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
        /*
         * Copy this subdirectory to the new location
         *

If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to
replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(),
etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are not
easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow
down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

#13

pguo@pivotal.io

about 7 years ago

In reply to: Paul Guo (#12)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:

Show quoted text

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote

in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this

database

create redo error, but I suspect some other kind of redo, which

depends on

the files under the directory (they are not copied since the

directory is

not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us

cope

| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead

suppress

| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().

2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code
change could fix that.
diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
*/
FlushDatabaseBuffers(xlrec->src_db_id);
+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
/*
* Copy this subdirectory to the new location
*
If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

#14

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Paul Guo (#13)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hi.

At Tue, 30 Apr 2019 14:33:47 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGhmDKrq7JJu2rLLqcJBR8pA4OYrKsirZ5Ft8-deG1e8A@mail.gmail.com>

I updated the original patch to

It's reasonable not to touch copydir.

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

Nice catch.

+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {

This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.

+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));

This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.

# I'm not confident in this at all:(

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍

Yeah, good to hear.

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:

If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Agreed to this.

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

pguo@pivotal.io

about 7 years ago

In reply to: Kyotaro Horiguchi (#14)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.

I do not understand this. Can you elaborate?

+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));
This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.

Yeah. Looks better.

# I'm not confident in this at all:(

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is not
needed.

Whatever ignore mkdir failure or mkdir_p, I found that these steps

seem to

be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.

#16

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Paul Guo (#15)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>

Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?

Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));
This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.
Yeah. Looks better.

# I'm not confident in this at all:(

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is not
needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Whatever ignore mkdir failure or mkdir_p, I found that these steps

seem to

be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.

I didn't mention replication. I said that that slows recovery,
which is not governed by master's speed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17

pguo@pivotal.io

about 7 years ago

In reply to: Kyotaro Horiguchi (#16)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <
CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>
Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/

total 0

lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2

$ ls -lh /tmp/2

total 0

drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is

not

needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Yes, this is a good reason to keep the original code. Thanks.

By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.

#18

pguo@pivotal.io

almost 7 years ago

In reply to: Paul Guo (#17)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, May 27, 2019 at 9:39 PM Paul Guo <pguo@pivotal.io> wrote:

On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <
CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>
Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.
This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/

total 0

lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2

$ ls -lh /tmp/2

total 0

drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221

Also I'd suggest we should use pg_mkdir_p() in

TablespaceCreateDbspace()

to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be

more

graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.

This change just makes the code concise. Though in theory the change is

not

needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Yes, this is a good reason to keep the original code. Thanks.

By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.

I updated the patch to v3. In this version, we skip the error if copydir
fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism
(Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a
previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed
to make the message accurate.

Thanks.

#19

https://travis-ci.org/postgresql-cfbot/postgresql/builds/555368907

thomas.munro@gmail.com

almost 7 years ago

In reply to: Paul Guo (#18)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <pguo@pivotal.io> wrote:

I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate.

Hello Paul,

FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:

--
Thomas Munro
https://enterprisedb.com

#20

pguo@pivotal.io

almost 7 years ago

In reply to: Thomas Munro (#19)

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

On Mon, Jul 8, 2019 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <pguo@pivotal.io> wrote:

I updated the patch to v3. In this version, we skip the error if copydir

fails due to missing src/dst directory,

but to make sure the ignoring is legal, I add a simple log/forget

mechanism (Using List) similar to the xlog invalid page

checking mechanism. Two tap tests are included. One is actually from a

previous patch by Kyotaro in this

email thread and another is added by me. In addition, dbase_desc() is

fixed to make the message accurate.

Hello Paul,

FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:

https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_postgresql-2Dcfbot_postgresql_builds_555368907&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=ABylo8AVfubiiYVbCBSgmNnHEMJhMqGXx5c0hkug7Vw&s=5h4m_JhrZwZqsRsu1CHCD3W2eBl14mT8jWLFsj2-bJ4&e=

This failure is because the previous v3 patch does not align with a recent
patch

commit 660a2b19038b2f6b9f6bcb2c3297a47d5e3557a8

Author: Noah Misch <noah@leadboat.com>

Date: Fri Jun 21 20:34:23 2019 -0700

Consolidate methods for translating a Perl path to a Windows path.

My patch uses TestLib::real_dir which is now replaced
with TestLib::perl2host in the above commit.

I've updated the patch to v4 to make my code align. Now the test passes in
my local environment.

Please see the attached v4 patch.

Thanks.

#21

thomas.munro@gmail.com

almost 7 years ago

In reply to: Paul Guo (#20)

#22

pguo@pivotal.io

almost 7 years ago

In reply to: Thomas Munro (#21)

#23

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 7 years ago

In reply to: Paul Guo (#22)

#24

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Anastasia Lubennikova (#23)

#25

pguo@pivotal.io

over 6 years ago

In reply to: Alvaro Herrera (#24)

#26

apraveen@pivotal.io

over 6 years ago

In reply to: Anastasia Lubennikova (#23)

#27

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Asim R P (#26)

#28

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Anastasia Lubennikova (#27)

#29

apraveen@pivotal.io

over 6 years ago

In reply to: Kyotaro Horiguchi (#28)

#30

apraveen@pivotal.io

over 6 years ago

In reply to: Paul Guo (#22)

#31

apraveen@pivotal.io

over 6 years ago

In reply to: Asim R P (#30)

#32

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Asim R P (#31)

#33

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Anastasia Lubennikova (#32)

#34

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Alvaro Herrera (#33)

#35

pguo@pivotal.io

over 6 years ago

In reply to: Alvaro Herrera (#34)

#36

pguo@pivotal.io

over 6 years ago

In reply to: Paul Guo (#35)

#37

Fujii Masao

masao.fujii@gmail.com

over 6 years ago

In reply to: Paul Guo (#36)

#38

Fujii Masao

masao.fujii@gmail.com

about 6 years ago

In reply to: Fujii Masao (#37)

#39

Daniel Gustafsson

daniel@yesql.se

almost 6 years ago

In reply to: Fujii Masao (#38)

#40

guopa@vmware.com

almost 6 years ago

In reply to: Alvaro Herrera (#33)

#41

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Paul Guo (#40)

#42

guopa@vmware.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#41)

#43

alvherre@2ndquadrant.com

about 5 years ago

In reply to: Paul Guo (#42)

#44

guopa@vmware.com

about 5 years ago

In reply to: Alvaro Herrera (#43)

#45

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 5 years ago

In reply to: Paul Guo (#44)

#46

guopa@vmware.com

almost 5 years ago

In reply to: Ibrar Ahmed (#45)

#47

robertmhaas@gmail.com

almost 5 years ago

In reply to: Paul Guo (#46)

#48

paulguo@gmail.com

almost 5 years ago

In reply to: Robert Haas (#47)

#49

robertmhaas@gmail.com

almost 5 years ago

In reply to: Paul Guo (#48)

#50

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#47)

#51

Daniel Gustafsson

daniel@yesql.se

over 4 years ago

In reply to: Tom Lane (#50)

#52

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Daniel Gustafsson (#51)

#53

michael@paquier.xyz

over 4 years ago

In reply to: Kyotaro Horiguchi (#52)

#54

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Michael Paquier (#53)

#55

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#54)

#56

alvherre@2ndquadrant.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#55)

#57

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#56)

#58

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#57)

#59

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#58)

#60

Julien Rouhaud

rjuju123@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#59)

#61

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Julien Rouhaud (#60)

#62

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#61)

#63

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Julien Rouhaud (#60)

#64

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#63)

#65

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#64)

#66

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#65)

#67

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#66)

#68

michael@paquier.xyz

about 4 years ago

In reply to: Kyotaro Horiguchi (#67)

#69

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Michael Paquier (#68)

#70

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#69)

#71

robertmhaas@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#70)

#72

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#71)

#73

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Michael Paquier (#68)

#74

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#70)

#75

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Robert Haas (#71)

#76

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Alvaro Herrera (#74)

#77

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#76)

#78

thomas.munro@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#77)

#79

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Thomas Munro (#78)

#80

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#77)

#81

robertmhaas@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#76)

#82

robertmhaas@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#75)

#83

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#81)

#84

michael@paquier.xyz

about 4 years ago

In reply to: Thomas Munro (#78)

#85

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#82)

#86

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#85)

#87

robertmhaas@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#86)

#88

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Robert Haas (#87)

#89

robertmhaas@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#88)

#90

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#89)

#91

robertmhaas@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#90)

#92

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#91)

#93

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#92)

#94

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#93)

#95

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#94)

#96

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#95)

#97

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#96)

#98

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#96)

#99

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#98)

#100

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#99)

#101

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#100)

#102

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#101)

#103

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#101)

#104

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#103)

#105

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#100)

#106

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#105)

#107

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#106)

#108

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#106)

#109

thomas.munro@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#105)

#110

thomas.munro@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#108)

#111

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Thomas Munro (#109)

#112

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Thomas Munro (#110)

#113

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#112)

#114

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#110)

#115

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#113)

#116

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#114)

#117

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#116)

#118

thomas.munro@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#116)

#119

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#117)

#120

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#105)

#121

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#120)

#122

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Matthias van de Meent (#121)

#123

thomas.munro@gmail.com

almost 4 years ago

In reply to: Tom Lane (#122)

#124

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#123)

#125

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#124)

#126

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Alvaro Herrera (#125)

#127

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Tom Lane (#126)

#128

thomas.munro@gmail.com

almost 4 years ago

In reply to: Tom Lane (#127)

#129

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Thomas Munro (#128)

#130

thomas.munro@gmail.com

almost 4 years ago

In reply to: Tom Lane (#126)

#131

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Tom Lane (#126)

#132

thomas.munro@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#130)

#133

thomas.munro@gmail.com

almost 4 years ago

In reply to: Tom Lane (#129)

#134