standby recovery fails (tablespace related) (tentative patch and discussion)
Hello postgres hackers,
Recently my colleagues and I encountered an issue: a standby can
not recover after an unclean shutdown and it's related to tablespace.
The issue is that the standby re-replay some xlog that needs tablespace
directories (e.g. create a database with tablespace),
but the tablespace directories has already been removed in the
previous replay.
In details, the standby normally finishes replaying for the below
operations, but due to unclean shutdown, the redo lsn
is not updated in pg_control and is still kept a value before the 'create
db with tabspace' xlog, however since the tablespace
directories were removed so it reports error when repay the database create
wal.
create db with tablespace
drop database
drop tablespace.
Here is the log on the standby.
2019-04-17 14:52:14.926 CST [23029] LOG: starting PostgreSQL 12devel on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat
4.8.5-4), 64-bit
2019-04-17 14:52:14.927 CST [23029] LOG: listening on IPv4 address
"192.168.35.130", port 5432
2019-04-17 14:52:14.929 CST [23029] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5432"
2019-04-17 14:52:14.943 CST [23030] LOG: database system was interrupted
while in recovery at log time 2019-04-17 14:48:27 CST
2019-04-17 14:52:14.943 CST [23030] HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2019-04-17 14:52:14.949 CST [23030] LOG: entering standby mode
2019-04-17 14:52:14.950 CST [23030] LOG: redo starts at 0/30105B8
2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
2019-04-17 14:52:14.951 CST [23029] LOG: startup process (PID 23030)
exited with exit code 1
2019-04-17 14:52:14.951 CST [23029] LOG: terminating any other active
server processes
2019-04-17 14:52:14.953 CST [23029] LOG: database system is shut down
Steps to reprodce:
1. setup a master and standby.
2. On both side, run: mkdir /tmp/some_isolation2_pg_basebackup_tablespace
3. Run SQLs:
drop tablespace if exists some_isolation2_pg_basebackup_tablespace;
create tablespace some_isolation2_pg_basebackup_tablespace location
'/tmp/some_isolation2_pg_basebackup_tablespace';
3. Clean shutdown and restart both postgres instances.
4. Run the following SQLs:
drop database if exists some_database_with_tablespace;
create database some_database_with_tablespace tablespace
some_isolation2_pg_basebackup_tablespace;
drop database some_database_with_tablespace;
drop tablespace some_isolation2_pg_basebackup_tablespace;
\! pkill -9 postgres; ssh host70 pkill -9 postgres
Note immediate shutdown via pg_ctl should also be able to reproduce and the
above steps probably does not 100% reproduce.
I created an initial patch for this issue (see the attachment). The idea is
re-creating those directories recursively. The above issue exists
in dbase_redo(),
but TablespaceCreateDbspace (for relation file create redo) is probably
buggy also so I modified that function also. Even there is no bug
in that function, it seems that using simple pg_mkdir_p() is cleaner. Note
reading TablespaceCreateDbspace(), I found it seems that this issue
has already be thought though insufficient but frankly this solution
(directory recreation) seems to be not perfect given actually this should
have been the responsibility of tablespace creation (also tablespace
creation does more like symlink creation, etc). Also, I'm not sure whether
we need to use invalid page mechanism (see xlogutils.c).
Another solution is that, actually, we create a checkpoint when
createdb/movedb/dropdb/droptablespace, maybe we should enforce to create
restartpoint on standby for such special kind of checkpoint wal - that
means we need to set a flag in checkpoing wal and let checkpoint redo
code to create restartpoint if that flag is set. This solution seems to be
safer.
Thanks,
Paul
Attachments:
0001-Recursively-create-tablespace-directories-if-those-a.patchapplication/octet-stream; name=0001-Recursively-create-tablespace-directories-if-those-a.patchDownload+19-28
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:
create db with tablespace
drop database
drop tablespace.
Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?
Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.
Asim
Please see my replies inline. Thanks.
On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:
create db with tablespace
drop database
drop tablespace.Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?
No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.
Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would
'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creating
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.
'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).
More feedback about all of the previous discussed solutions is welcome.
Hello.
At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>
Please see my replies inline. Thanks.
On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:
create db with tablespace
drop database
drop tablespace.Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.
That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.
Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.
But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creatingdirectories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.
'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).More feedback about all of the previous discussed solutions is welcome.
It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.
Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
ignore_dir_create_error_before_consistency.patchtext/x-patch; charset=us-asciiDownload+46-5
Oops! The comment in the previous patch is wrong.
At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>
At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>
Please see my replies inline. Thanks.
On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:
create db with tablespace
drop database
drop tablespace.Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creatingdirectories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.
'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).More feedback about all of the previous discussed solutions is welcome.
It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.
The comment for the new function XLogMakePGDirectory is wrong:
+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.
The correct one is:
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
It is fixed in the attached.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
ignore_dir_create_error_before_consistency_v2.patchtext/x-patch; charset=us-asciiDownload+46-5
At Mon, 22 Apr 2019 16:40:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.164027.33866403.horiguchi.kyotaro@lab.ntt.co.jp>
At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>
At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>
Please see my replies inline. Thanks.
On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:
create db with tablespace
drop database
drop tablespace.Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.
The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Allow-TAP-test-to-excecise-tablespace.patchtext/x-patch; charset=us-asciiDownload+37-7
0002-Add-check-for-recovery-failure-caused-by-tablespace.patchtext/x-patch; charset=us-asciiDownload+51-2
0003-Fix-failure-of-standby-startup-caused-by-tablespace-.patchtext/x-patch; charset=us-asciiDownload+42-6
On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:
The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.
+ # Check for symlink -- needed only on source dir
+ # (note: this will fall through quietly if file is already gone)
+ if (-l $srcpath)
+ {
+ croak "Cannot operate on symlink \"$srcpath\""
+ if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+ # We have mapped tablespaces. Copy them individually
+ my $linkname = $1;
+ my $tmpdir = TestLib::tempdir;
+ my $dstrealdir = TestLib::real_dir($tmpdir);
+ my $srcrealdir = readlink($srcpath);
+
+ opendir(my $dh, $srcrealdir);
+ while (readdir $dh)
+ {
+ next if (/^\.\.?$/);
+ my $spath = "$srcrealdir/$_";
+ my $dpath = "$dstrealdir/$_";
+
+ copypath($spath, $dpath);
+ }
+ closedir $dh;
+
+ symlink $dstrealdir, $destpath;
+ return 1;
+ }
The same stuff is proposed here:
/messages/by-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@mail.gmail.com
So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here? I am not sure
yet which one of the proposals is better than the other though.
--
Michael
At Tue, 23 Apr 2019 11:34:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20190423023438.GH2712@paquier.xyz>
On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:
The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.+ # Check for symlink -- needed only on source dir + # (note: this will fall through quietly if file is already gone) + if (-l $srcpath) + { + croak "Cannot operate on symlink \"$srcpath\"" + if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/); + + # We have mapped tablespaces. Copy them individually + my $linkname = $1; + my $tmpdir = TestLib::tempdir; + my $dstrealdir = TestLib::real_dir($tmpdir); + my $srcrealdir = readlink($srcpath); + + opendir(my $dh, $srcrealdir); + while (readdir $dh) + { + next if (/^\.\.?$/); + my $spath = "$srcrealdir/$_"; + my $dpath = "$dstrealdir/$_"; + + copypath($spath, $dpath); + } + closedir $dh; + + symlink $dstrealdir, $destpath; + return 1; + }The same stuff is proposed here:
/messages/by-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@mail.gmail.comSo there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here? I am not sure
yet which one of the proposals is better than the other though.
TBH I like that (my one cieted above) not so much. However, I
prefer to have v12 if this is a bug and to be fixed in
v12. Otherwise we won't add a test for this later:p
Anyway I'll visit there. Thanks.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.
On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Show quoted text
Oops! The comment in the previous patch is wrong.
At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190422.161513.258021727.horiguchi.kyotaro@lab.ntt.co.jp>At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <pguo@pivotal.io> wrote in
<CAEET0ZGpUrMGUzfyzVF9FuSq+zb=QovYa2cvyRnDOTvZ5vXxTw@mail.gmail.com>
Please see my replies inline. Thanks.
On Fri, Apr 19, 2019 at 12:38 PM Asim R P <apraveen@pivotal.io> wrote:
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <pguo@pivotal.io> wrote:
create db with tablespace
drop database
drop tablespace.Essentially, that sequence of operations causes crash recovery to
fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby. Is that right?No. In general, checkpoint is done for
drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand therelated
code correctly.
For standby, checkpoint redo does not ensure that.That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.Your patch creates missing directories in the destination. Don't we
need to create the tablespace symlink under pg_tblspc/? I would'create db with tablespace' redo log does not include the tablespace
real
directory information.
Yes, we could add in it into the xlog, but that seems to be anoverdesign.
But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list. It will allow us to avoid creatingdirectories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.
'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g.reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).More feedback about all of the previous discussed solutions is welcome.
It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.The comment for the new function XLogMakePGDirectory is wrong:
+ * There is a possibility that WAL replay causes a creation of the same + * directory left by the previous crash. Issuing ERROR prevents the caller + * from continuing recovery.The correct one is:
+ * There is a possibility that WAL replay causes an error by creation of a + * directory under a directory removed before the previous crash. Issuing + * ERROR prevents the caller from continuing recovery.It is fixed in the attached.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.
If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.
If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.
XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)
So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:
me> but I haven't checked this covers all places where need the same
me> treatment.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Mmm. I posted to wrong thread. Sorry.
At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:me> but I haven't checked this covers all places where need the same
me> treatment.
RM_DBASE_ID is fixed by the patch.
XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.
HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.
SMGR:
- Both CREATE and TRUNCATE seems fine.
TBLSPC:
- We don't nest tablespace directories. No Problem.
I don't find a similar case.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0004-Fix-failure-of-standby-startup-caused-by-tablespace-.patchtext/x-patch; charset=us-asciiDownload+48-12
On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Mmm. I posted to wrong thread. Sorry.
At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in
<CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
database
create redo error, but I suspect some other kind of redo, which
depends on
the files under the directory (they are not copied since the directory
is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets uscope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:me> but I haven't checked this covers all places where need the same
me> treatment.RM_DBASE_ID is fixed by the patch.
XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.SMGR:
- Both CREATE and TRUNCATE seems fine.TBLSPC:
- We don't nest tablespace directories. No Problem.I don't find a similar case.
I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.
1) The below error message is actually misleading.
2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().
2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,
\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;
Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change
could fix that.
diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
*/
FlushDatabaseBuffers(xlrec->src_db_id);
+ /*
+ * It is possible that the source directory is missing if
+ * we are re-replaying the xlog while subsequent xlogs
+ * drop the tablespace in previous replaying. For this
+ * we just skip.
+ */
+ if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ return;
+
/*
* Copy this subdirectory to the new location
*
If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().
Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to
replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(),
etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.
Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are not
easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow
down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?
I updated the original patch to
1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.
2) Fixed dbase_desc(). Now the xlog output looks correct.
rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386
rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386
I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍
On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:
Show quoted text
On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:Mmm. I posted to wrong thread. Sorry.
At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote
in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
database
create redo error, but I suspect some other kind of redo, which
depends on
the files under the directory (they are not copied since the
directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist. This lets uscope
| * if the replay sequence contains writes to a relation that is later
| * deleted. (The original coding of this routine would insteadsuppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash. Better to write the data
| * until we are actually told to delete the file.)So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:me> but I haven't checked this covers all places where need the same
me> treatment.RM_DBASE_ID is fixed by the patch.
XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.SMGR:
- Both CREATE and TRUNCATE seems fine.TBLSPC:
- We don't nest tablespace directories. No Problem.I don't find a similar case.
I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.1) The below error message is actually misleading.
2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code
change could fix that.diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 9707afabd9..7d755c759e 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record) */ FlushDatabaseBuffers(xlrec->src_db_id);+ /* + * It is possible that the source directory is missing if + * we are re-replaying the xlog while subsequent xlogs + * drop the tablespace in previous replaying. For this + * we just skip. + */ + if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode))) + return; + /* * Copy this subdirectory to the new location *If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?
Attachments:
v2-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchapplication/octet-stream; name=v2-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchDownload+44-34
Hi.
At Tue, 30 Apr 2019 14:33:47 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZGhmDKrq7JJu2rLLqcJBR8pA4OYrKsirZ5Ft8-deG1e8A@mail.gmail.com>
I updated the original patch to
It's reasonable not to touch copydir.
1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.
Nice catch.
+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ {
This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
+ ereport(WARNING,
+ (errmsg("directory \"%s\" for copydir() does not exists."
+ "It is possibly expected. Skip copydir().",
+ parent_path)));
This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.
| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.
# I'm not confident in this at all:(
2) Fixed dbase_desc(). Now the xlog output looks correct.
rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386
WAL records don't convey such information. The previous
description seems right to me.
I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍
Yeah, good to hear.
On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:
If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().
Agreed to this.
Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.
But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.
Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?
That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Thanks for the reply.
On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) + {This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.
I do not understand this. Can you elaborate?
+ ereport(WARNING, + (errmsg("directory \"%s\" for copydir() does not exists." + "It is possibly expected. Skip copydir().", + parent_path)));This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.
Yeah. Looks better.
# I'm not confident in this at all:(
2) Fixed dbase_desc(). Now the xlog output looks correct.
rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386WAL records don't convey such information. The previous
description seems right to me.
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.
Also I'd suggest we should use pg_mkdir_p() in
TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
This change just makes the code concise. Though in theory the change is not
needed.
Whatever ignore mkdir failure or mkdir_p, I found that these steps
seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.
This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.
Hello.
At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>
Thanks for the reply.
On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) + {This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.
Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.
+ ereport(WARNING, + (errmsg("directory \"%s\" for copydir() does not exists." + "It is possibly expected. Skip copydir().", + parent_path)));This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.| WARNING: skipped creating database directory: "%s"
| DETAIL: The tabelspace %u may have been removed just before crash.Yeah. Looks better.
# I'm not confident in this at all:(
2) Fixed dbase_desc(). Now the xlog output looks correct.
rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386WAL records don't convey such information. The previous
description seems right to me.2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.
The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.
Also I'd suggest we should use pg_mkdir_p() in
TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
This change just makes the code concise. Though in theory the change is not
needed.
We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:
| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.
If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.
Whatever ignore mkdir failure or mkdir_p, I found that these steps
seem to
be error-prone
along with postgre evolving since they are hard to test and also we are
not easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will
slow down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.
I didn't mention replication. I said that that slows recovery,
which is not governed by master's speed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.
At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <
CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>Thanks for the reply.
On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) + {This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.
This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.
2) Fixed dbase_desc(). Now the xlog output looks correct.
rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386WAL records don't convey such information. The previous
description seems right to me.2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.
In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.
$ ls -lh data/pg_tblspc/
total 0
lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2
$ ls -lh /tmp/2
total 0
drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221
Also I'd suggest we should use pg_mkdir_p() in
TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
This change just makes the code concise. Though in theory the change is
not
needed.
We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.
Yes, this is a good reason to keep the original code. Thanks.
By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.
On Mon, May 27, 2019 at 9:39 PM Paul Guo <pguo@pivotal.io> wrote:
On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello.
At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <pguo@pivotal.io> wrote in <
CAEET0ZF9yN4DaXyuFLzOcAYyxuFF1Ms_OQWeA+Rwv3GhA5Q-SA@mail.gmail.com>Thanks for the reply.
On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode))) + {This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.I do not understand this. Can you elaborate?
Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.2) Fixed dbase_desc(). Now the xlog output looks correct.
rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386WAL records don't convey such information. The previous
description seems right to me.2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.$ ls -lh data/pg_tblspc/
total 0
lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2
$ ls -lh /tmp/2
total 0
drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221
Also I'd suggest we should use pg_mkdir_p() in
TablespaceCreateDbspace()
to replace
the code block includes a lot of
get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to bemore
graceful and simpler.
But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
This change just makes the code concise. Though in theory the change is
not
needed.
We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:| * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
| * or TablespaceCreateDbspace is running concurrently.If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.Yes, this is a good reason to keep the original code. Thanks.
By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.
I updated the patch to v3. In this version, we skip the error if copydir
fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism
(Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a
previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed
to make the message accurate.
Thanks.
Attachments:
v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchapplication/octet-stream; name=v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patchDownload+259-14
On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <pguo@pivotal.io> wrote:
I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate.
Hello Paul,
FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:
https://travis-ci.org/postgresql-cfbot/postgresql/builds/555368907
--
Thomas Munro
https://enterprisedb.com
On Mon, Jul 8, 2019 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <pguo@pivotal.io> wrote:
I updated the patch to v3. In this version, we skip the error if copydir
fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget
mechanism (Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a
previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is
fixed to make the message accurate.
Hello Paul,
FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:
This failure is because the previous v3 patch does not align with a recent
patch
commit 660a2b19038b2f6b9f6bcb2c3297a47d5e3557a8
Author: Noah Misch <noah@leadboat.com>
Date: Fri Jun 21 20:34:23 2019 -0700
Consolidate methods for translating a Perl path to a Windows path.
My patch uses TestLib::real_dir which is now replaced
with TestLib::perl2host in the above commit.
I've updated the patch to v4 to make my code align. Now the test passes in
my local environment.
Please see the attached v4 patch.
Thanks.