[bug fix] pg_rewind takes long time because it mistakenly copies data files

Started by Tsunakawa, Takayukialmost 8 years ago8 messages
#1Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
1 attachment(s)

Hello,

Our customer reported that pg_rewind took many hours to synchronize 400GB of data, even if the new primary doesn't perform any updates. The attached patch fixes that.

The cause was that pg_rewind failed to recognize data files in tablespace directories, resulting in the full copy of those files instead of WAL replay.

Regards
Takayuki Tsunakawa

Attachments:

pg_rewind_pathcmp.patchapplication/octet-stream; name=pg_rewind_pathcmp.patchDownload
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 19da1ee..727bccc 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -19,6 +19,7 @@
 #include "pg_rewind.h"
 
 #include "common/string.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "storage/fd.h"
 
@@ -554,7 +555,6 @@ print_filemap(void)
 static bool
 isRelDataFile(const char *path)
 {
-	char		buf[20 + 1];
 	RelFileNode rnode;
 	unsigned int segNo;
 	int			nmatch;
@@ -603,21 +603,19 @@ isRelDataFile(const char *path)
 		}
 		else
 		{
-			nmatch = sscanf(path, "pg_tblspc/%u/PG_%20s/%u/%u.%u",
-							&rnode.spcNode, buf, &rnode.dbNode, &rnode.relNode,
+			nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
+							&rnode.spcNode, &rnode.dbNode, &rnode.relNode,
 							&segNo);
-			if (nmatch == 4 || nmatch == 5)
+			if (nmatch == 3 || nmatch == 4)
 				matched = true;
 		}
 	}
 
 	/*
 	 * The sscanf tests above can match files that have extra characters at
-	 * the end, and the last check can also match a path belonging to a
-	 * different version (different TABLESPACE_VERSION_DIRECTORY). To make
-	 * eliminate such cases, cross-check that GetRelationPath creates the
-	 * exact same filename, when passed the RelFileNode information we
-	 * extracted from the filename.
+	 * the end. To eliminate such cases, cross-check that GetRelationPath
+	 * creates the exact same filename, when passed the RelFileNode information
+	 * we extracted from the filename.
 	 */
 	if (matched)
 	{
#2Michael Paquier
michael@paquier.xyz
In reply to: Tsunakawa, Takayuki (#1)
Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

On Mon, Feb 26, 2018 at 06:01:43AM +0000, Tsunakawa, Takayuki wrote:

The cause was that pg_rewind failed to recognize data files in
tablespace directories, resulting in the full copy of those files
instead of WAL replay.

Ouch. Confirmed.

If I test pg_rewind with a tablespace (primary and standby on same
server, base backup taken using --tablespace-mapping), then I bump
immediately into an assertion failure:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff99aa3f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff99aa3f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff8a30e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x00007fff9a8d46df libsystem_c.dylib`abort + 129
frame #3: 0x00007fff9a89bdd8 libsystem_c.dylib`__assert_rtn + 321
frame #4: 0x000000010f7680f6 pg_rewind`process_block_change(forknum=MAIN_FORKNUM, rnode=(spcNode = 16385, dbNode = 16384, relNode = 16386), blkno=4424) + 406 at filemap.c:374
frame #5: 0x000000010f760c9f pg_rewind`extractPageInfo(record=0x00007fc742800600) + 399 at parsexlog.c:401
frame #6: 0x000000010f76055e pg_rewind`extractPageMap(datadir="/Users/XXX/data/5432", startpoint=0, tliIndex=0, endpoint=134217768) + 270 at parsexlog.c:97
frame #7: 0x000000010f75f48c pg_rewind`main(argc=5, argv=0x00007fff504a1450) + 1964 at pg_rewind.c:300
frame #8: 0x00007fff88daf5ad libdyld.dylib`start + 1
(lldb) up 4
frame #4: 0x000000010f7680f6 pg_rewind`process_block_change(forknum=MAIN_FORKNUM, rnode=(spcNode = 16385, dbNode = 16384, relNode = 16386), blkno=4424) + 406 at filemap.c:374
371
372 if (entry)
373 {
-> 374 Assert(entry->isrelfile);
375
376 switch (entry->action)
377 {
(lldb) p *entry
(file_entry_t) $0 = {
path = 0x00007fc742424400 "pg_tblspc/16385/PG_11_201802061/16384/16386"
type = FILE_TYPE_REGULAR
action = FILE_ACTION_COPY
oldsize = 0
newsize = 36249600
isrelfile = '\0'
pagemap = (bitmap = <no value available>, bitmapsize = 0)
link_target = 0x00007fc742424430 <no value available>
next = 0x0000000000000000
}

Your patch is able to fix that. I have also checked that after
diverging the promoted server with more data and inserting data on the
old primary then the correct set of blocks from the tablespace is fetched as
well by pg_rewind. This patch also behaves correctly when creating a
new relation on the promoted server as it copies the whole relation. In
short your patch looks good to me.

Creating a test case for this patch would be nice, however this needs a
bit of work so as the tablespace map can be used with pg_basebackup and
PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
- PostgresNode::init_from_backup needs to be able to understand extra
options given by caller for pg_basebackup.
- RewindTest::create_standby should be extended with extra arguments
given for previous extension.
:(
--
Michael

#3Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Michael Paquier (#2)
RE: [bug fix] pg_rewind takes long time because it mistakenly copies data files

From: Michael Paquier [mailto:michael@paquier.xyz]

Your patch is able to fix that. I have also checked that after diverging
the promoted server with more data and inserting data on the old primary
then the correct set of blocks from the tablespace is fetched as well by
pg_rewind. This patch also behaves correctly when creating a new relation
on the promoted server as it copies the whole relation. In short your patch
looks good to me.

How quick, I was surprised. Thank you so much! I'd be glad if you could be the reviewer in CF:

https://commitfest.postgresql.org/17/1542/

Creating a test case for this patch would be nice, however this needs a
bit of work so as the tablespace map can be used with pg_basebackup and
PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
- PostgresNode::init_from_backup needs to be able to understand extra
options given by caller for pg_basebackup.
- RewindTest::create_standby should be extended with extra arguments given
for previous extension.
:(

That sounds difficult from your way of saying... but this may be a good opportunity to look into the TAP tests.

Regards
Takayuki Tsunakawa

#4Michael Paquier
michael@paquier.xyz
In reply to: Tsunakawa, Takayuki (#3)
Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

On Mon, Feb 26, 2018 at 08:13:02AM +0000, Tsunakawa, Takayuki wrote:

From: Michael Paquier [mailto:michael@paquier.xyz]

Your patch is able to fix that. I have also checked that after diverging
the promoted server with more data and inserting data on the old primary
then the correct set of blocks from the tablespace is fetched as well by
pg_rewind. This patch also behaves correctly when creating a new relation
on the promoted server as it copies the whole relation. In short your patch
looks good to me.

How quick, I was surprised. Thank you so much! I'd be glad if you could be the reviewer in CF:

https://commitfest.postgresql.org/17/1542/

Sure.

Creating a test case for this patch would be nice, however this needs a
bit of work so as the tablespace map can be used with pg_basebackup and
PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
- PostgresNode::init_from_backup needs to be able to understand extra
options given by caller for pg_basebackup.
- RewindTest::create_standby should be extended with extra arguments given
for previous extension.
:(

That sounds difficult from your way of saying... but this may be a
good opportunity to look into the TAP tests.

Anything like that would be work only for HEAD I think as that's a bit
of refactoring. And indeed it could give you a good introduction to the
TAP facility.
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
1 attachment(s)
Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

On Mon, Feb 26, 2018 at 06:24:02PM +0900, Michael Paquier wrote:

Anything like that would be work only for HEAD I think as that's a bit
of refactoring. And indeed it could give you a good introduction to the
TAP facility.

So I would propose to just do that later. I have looked a second time
at your patch, attached is the set of tests I have run:
- With assertions enabled, I see pg_rewind failing on an assertion as
mentioned upthread.
- With assertion disabled (look at rewind_logs.log as generated by the
script attached), then all the files from the tablespace are copied with
all the file chunks fetched at the second phase of the rewind.
- With the patch, both the assertion and the handling of tablespace
files are fixed. Again, in the logs produced by the script you would
see that tablespace files are not completely copied anymore, and that
only portions of them are.

I have one small comment though. The comment block at the beginning of
isRelDataFile() refers to "pg_tblspc/<tblspc oid>/PG_9.4_201403261/".
Could you switch that to "pg_tblspc/<tblspc oid>/<tblspc version>"?
That's not directly the fault of your patch, but as long as we are on
it..

So I am marking this as ready for committer. Thanks for the report,
Tsunakawa-san!
--
Michael

Attachments:

rewind_tbspace.bashtext/plain; charset=us-asciiDownload
#6Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Michael Paquier (#5)
1 attachment(s)
RE: [bug fix] pg_rewind takes long time because it mistakenly copies data files

From: Michael Paquier [mailto:michael@paquier.xyz]

So I would propose to just do that later. I have looked a second time at
your patch, attached is the set of tests I have run:

Thanks so much, that has helped me a lot!

I have one small comment though. The comment block at the beginning of
isRelDataFile() refers to "pg_tblspc/<tblspc oid>/PG_9.4_201403261/".
Could you switch that to "pg_tblspc/<tblspc oid>/<tblspc version>"?
That's not directly the fault of your patch, but as long as we are on it..

Done, thanks again for marking the CF entry.

Regards
Takayuki Tsunakawa

Attachments:

pg_rewind_pathcmp_v2.patchapplication/octet-stream; name=pg_rewind_pathcmp_v2.patchDownload
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 19da1ee..6122f17 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -19,6 +19,7 @@
 #include "pg_rewind.h"
 
 #include "common/string.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "storage/fd.h"
 
@@ -554,7 +555,6 @@ print_filemap(void)
 static bool
 isRelDataFile(const char *path)
 {
-	char		buf[20 + 1];
 	RelFileNode rnode;
 	unsigned int segNo;
 	int			nmatch;
@@ -569,7 +569,7 @@ isRelDataFile(const char *path)
 	 * base/<db oid>/
 	 *		regular relations, default tablespace
 	 *
-	 * pg_tblspc/<tblspc oid>/PG_9.4_201403261/
+	 * pg_tblspc/<tblspc oid>/<tblspc version>/
 	 *		within a non-default tablespace (the name of the directory
 	 *		depends on version)
 	 *
@@ -603,21 +603,19 @@ isRelDataFile(const char *path)
 		}
 		else
 		{
-			nmatch = sscanf(path, "pg_tblspc/%u/PG_%20s/%u/%u.%u",
-							&rnode.spcNode, buf, &rnode.dbNode, &rnode.relNode,
+			nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
+							&rnode.spcNode, &rnode.dbNode, &rnode.relNode,
 							&segNo);
-			if (nmatch == 4 || nmatch == 5)
+			if (nmatch == 3 || nmatch == 4)
 				matched = true;
 		}
 	}
 
 	/*
 	 * The sscanf tests above can match files that have extra characters at
-	 * the end, and the last check can also match a path belonging to a
-	 * different version (different TABLESPACE_VERSION_DIRECTORY). To make
-	 * eliminate such cases, cross-check that GetRelationPath creates the
-	 * exact same filename, when passed the RelFileNode information we
-	 * extracted from the filename.
+	 * the end. To eliminate such cases, cross-check that GetRelationPath
+	 * creates the exact same filename, when passed the RelFileNode information
+	 * we extracted from the filename.
 	 */
 	if (matched)
 	{
#7Fujii Masao
masao.fujii@gmail.com
In reply to: Tsunakawa, Takayuki (#6)
Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

On Wed, Feb 28, 2018 at 3:58 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Michael Paquier [mailto:michael@paquier.xyz]

So I would propose to just do that later. I have looked a second time at
your patch, attached is the set of tests I have run:

Thanks so much, that has helped me a lot!

I have one small comment though. The comment block at the beginning of
isRelDataFile() refers to "pg_tblspc/<tblspc oid>/PG_9.4_201403261/".
Could you switch that to "pg_tblspc/<tblspc oid>/<tblspc version>"?
That's not directly the fault of your patch, but as long as we are on it..

Done, thanks again for marking the CF entry.

Thanks for the patch! Pushed.

Regards,

--
Fujii Masao

#8Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Fujii Masao (#7)
RE: [bug fix] pg_rewind takes long time because it mistakenly copies data files

From: Fujii Masao [mailto:masao.fujii@gmail.com]

Thanks for the patch! Pushed.

Thank you. I'm glad to see you again on this list.

Regards
Takayuki Tsunakawa