Recovery performance of DROP DATABASE with many tablespaces

Started by Fujii Masaoover 7 years ago13 messages

masao.fujii@gmail.com

over 7 years ago

1 attachment(s)

Hi,

My colleague encountered the problem that WAL replay took a long time
in the standby with large shared_buffers when he dropped the database
using many tablespaces. As far as I read the code, this happens because
DROP DATABASE generates as many XLOG_DBASE_DROP WAL records as
the number of tablespaces that the database to drop uses,
and then WAL replay of one XLOG_DBASE_DROP record causes full scan of
shared_buffers. That is, DROP DATABASE causes the scans of shared_buffers
as many times as the number of the tablespaces during recovery.

Since the first scan caused by the first XLOG_DBASE_DROP record invalidates
all the pages related to the database to drop, in shared_buffers,
the subsequent scans by the subsequent records seem basically useless.
So I'd like to change the code so that we can avoid such subsequent
unnecessary scans, to reduce the recovery time of DROP DATABASE.

Generally the recovery performance of DROP DATABASE is not critical
for many users. But unfortunately my colleague's project might need to
sometimes drop the database using multiple tablespaces, for some reasons.
So, if the fix is not so complicated, I think that it's worth applying that.

The straight approach to avoid such unnecessary scans is to change
DROP DATABASE so that it generates only one XLOG_DBASE_DROP record,
and register the information of all the tablespace into it. Then, WAL replay
of XLOG_DBASE_DROP record scans shared_buffers once and deletes
all tablespaces. POC patch is attached.

Thought?

Regards,

--
Fujii Masao

Attachments:

xlog-dbase-drop_v1.patchapplication/octet-stream; name=xlog-dbase-drop_v1.patchDownload

*** a/src/backend/access/rmgrdesc/dbasedesc.c
--- b/src/backend/access/rmgrdesc/dbasedesc.c
***************
*** 35,43 **** dbase_desc(StringInfo buf, XLogReaderState *record)
  	else if (info == XLOG_DBASE_DROP)
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
  
! 		appendStringInfo(buf, "dir %u/%u",
! 						 xlrec->db_id, xlrec->tablespace_id);
  	}
  }
  
--- 35,46 ----
  	else if (info == XLOG_DBASE_DROP)
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
+ 		int		i;
  
! 		appendStringInfo(buf, "dir");
! 		for (i = 0; i < xlrec->tablespace_num; i++)
! 			appendStringInfo(buf, " %u/%u",
! 							 xlrec->db_id, xlrec->tablespace_ids[i]);
  	}
  }
  
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
***************
*** 1362,1371 **** movedb(const char *dbname, const char *tblspcname)
  		xl_dbase_drop_rec xlrec;
  
  		xlrec.db_id = db_id;
! 		xlrec.tablespace_id = src_tblspcoid;
  
  		XLogBeginInsert();
  		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
  
  		(void) XLogInsert(RM_DBASE_ID,
  						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
--- 1362,1372 ----
  		xl_dbase_drop_rec xlrec;
  
  		xlrec.db_id = db_id;
! 		xlrec.tablespace_num = 1;
  
  		XLogBeginInsert();
  		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
+ 		XLogRegisterData((char *) &src_tblspcoid, sizeof(Oid));
  
  		(void) XLogInsert(RM_DBASE_ID,
  						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
***************
*** 1873,1878 **** remove_dbtablespaces(Oid db_id)
--- 1874,1884 ----
  	Relation	rel;
  	HeapScanDesc scan;
  	HeapTuple	tuple;
+ 	List		*ltblspc = NIL;
+ 	ListCell	*cell;
+ 	int		ntblspc;
+ 	int		i;
+ 	Oid		*tablespace_ids;
  
  	rel = heap_open(TableSpaceRelationId, AccessShareLock);
  	scan = heap_beginscan_catalog(rel, 0, NULL);
***************
*** 1900,1922 **** remove_dbtablespaces(Oid db_id)
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dstpath)));
  
! 		/* Record the filesystem change in XLOG */
! 		{
! 			xl_dbase_drop_rec xlrec;
  
! 			xlrec.db_id = db_id;
! 			xlrec.tablespace_id = dsttablespace;
  
! 			XLogBeginInsert();
! 			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
  
! 			(void) XLogInsert(RM_DBASE_ID,
! 							  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
! 		}
  
! 		pfree(dstpath);
  	}
  
  	heap_endscan(scan);
  	heap_close(rel, AccessShareLock);
  }
--- 1906,1940 ----
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dstpath)));
  
! 		ltblspc = lappend_oid(ltblspc, dsttablespace);
! 		pfree(dstpath);
! 	}
  
! 	ntblspc = list_length(ltblspc);
! 	Assert(ntblspc > 0);
  
! 	tablespace_ids = (Oid *) palloc(ntblspc * sizeof(Oid));
! 	i = 0;
! 	foreach(cell, ltblspc)
! 		tablespace_ids[i++] = lfirst_oid(cell);
  
! 	/* Record the filesystem change in XLOG */
! 	{
! 		xl_dbase_drop_rec xlrec;
! 		xlrec.db_id = db_id;
! 		xlrec.tablespace_num = ntblspc;
  
! 		XLogBeginInsert();
! 		XLogRegisterData((char *) &xlrec, MinSizeOfDbaseDropRec);
! 		XLogRegisterData((char *) tablespace_ids, ntblspc * sizeof(Oid));
! 
! 		(void) XLogInsert(RM_DBASE_ID,
! 						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
  	}
  
+ 	list_free(ltblspc);
+ 	pfree(tablespace_ids);
+ 
  	heap_endscan(scan);
  	heap_close(rel, AccessShareLock);
  }
***************
*** 2122,2129 **** dbase_redo(XLogReaderState *record)
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
  		char	   *dst_path;
! 
! 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
  
  		if (InHotStandby)
  		{
--- 2140,2146 ----
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
  		char	   *dst_path;
! 		int			i;
  
  		if (InHotStandby)
  		{
***************
*** 2153,2163 **** dbase_redo(XLogReaderState *record)
  		/* Clean out the xlog relcache too */
  		XLogDropDatabase(xlrec->db_id);
  
! 		/* And remove the physical files */
! 		if (!rmtree(dst_path, true))
! 			ereport(WARNING,
! 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
! 							dst_path)));
  
  		if (InHotStandby)
  		{
--- 2170,2186 ----
  		/* Clean out the xlog relcache too */
  		XLogDropDatabase(xlrec->db_id);
  
! 		for (i = 0; i < xlrec->tablespace_num; i++)
! 		{
! 			dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
! 
! 			/* And remove the physical files */
! 			if (!rmtree(dst_path, true))
! 				ereport(WARNING,
! 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
! 								dst_path)));
! 			pfree(dst_path);
! 		}
  
  		if (InHotStandby)
  		{
*** a/src/include/commands/dbcommands_xlog.h
--- b/src/include/commands/dbcommands_xlog.h
***************
*** 32,41 **** typedef struct xl_dbase_create_rec
  
  typedef struct xl_dbase_drop_rec
  {
- 	/* Records dropping of a single subdirectory incl. contents */
  	Oid			db_id;
! 	Oid			tablespace_id;
  } xl_dbase_drop_rec;
  
  extern void dbase_redo(XLogReaderState *rptr);
  extern void dbase_desc(StringInfo buf, XLogReaderState *rptr);
--- 32,42 ----
  
  typedef struct xl_dbase_drop_rec
  {
  	Oid			db_id;
! 	int			tablespace_num;		/* number of tablespace IDs */
! 	Oid			tablespace_ids[FLEXIBLE_ARRAY_MEMBER];
  } xl_dbase_drop_rec;
+ #define MinSizeOfDbaseDropRec offsetof(xl_dbase_drop_rec, tablespace_ids)
  
  extern void dbase_redo(XLogReaderState *rptr);
  extern void dbase_desc(StringInfo buf, XLogReaderState *rptr);

Ashwin Agrawal

aagrawal@pivotal.io

over 7 years ago

In reply to: Fujii Masao (#1)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Mon, Jun 4, 2018 at 9:46 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Generally the recovery performance of DROP DATABASE is not critical
for many users. But unfortunately my colleague's project might need to
sometimes drop the database using multiple tablespaces, for some reasons.
So, if the fix is not so complicated, I think that it's worth applying
that.

Agree, in isolation need for this improvement is not felt, but yes any
improvements for single serialized replay process is definitely helpful.

The straight approach to avoid such unnecessary scans is to change
DROP DATABASE so that it generates only one XLOG_DBASE_DROP record,
and register the information of all the tablespace into it. Then, WAL
replay
of XLOG_DBASE_DROP record scans shared_buffers once and deletes
all tablespaces. POC patch is attached.

Also, irrespective of performance improvement looks better to just have
single xlog record for the same.

Jamison, Kirk

k.jamison@jp.fujitsu.com

over 7 years ago

In reply to: Fujii Masao (#1)

RE: Recovery performance of DROP DATABASE with many tablespaces

Hi, Fujii-san

I came across this post and I got interested in it,
so I tried to apply/test the patch but I am not sure if I did it correctly.
I set-up master-slave sync, 200GB shared_buffers, 20000 max_locks_per_transaction,
1 DB with 500 table partitions shared evenly across 5 tablespaces.

After dropping the db, with or without patch,
there were no difference in recovery performance when dropping database,
so maybe I made a mistake somewhere. But anyway, here's the results.

======WITHOUT PATCH=======
[200GB shared buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_13:35:00.161
dropdb
2018/07/04_13:35:05.591 5.591 sec

[200GB shared_buffers]
DROPDB (including DROP TABLE and DROP TABLESPACE)
real 3m19.717s
user 0m0.001s
sys 0m0.001s

======WITH PATCH=======
[200GB shared_buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_14:19:47.128
dropdb
2018/07/04_14:19:53.177 6.049 sec

[200GB shared_buffers]
DROPDB (included the DROP TABLE and DROP TABLESPACE commands)
real 3m51.834s
user 0m0.001s
sys 0m0.002s

Just in case, do you also have some performance test numbers/case
to show the recovery perf improvement when dropping database that contain multiple tablespaces?

Regards,
Kirk Jamison

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Jamison, Kirk (#3)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Wed, Jul 4, 2018 at 4:47 PM, Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:

Hi, Fujii-san

I came across this post and I got interested in it,
so I tried to apply/test the patch but I am not sure if I did it correctly.
I set-up master-slave sync, 200GB shared_buffers, 20000 max_locks_per_transaction,
1 DB with 500 table partitions shared evenly across 5 tablespaces.

After dropping the db, with or without patch,
there were no difference in recovery performance when dropping database,
so maybe I made a mistake somewhere. But anyway, here's the results.

======WITHOUT PATCH=======
[200GB shared buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_13:35:00.161
dropdb
2018/07/04_13:35:05.591 5.591 sec

[200GB shared_buffers]
DROPDB (including DROP TABLE and DROP TABLESPACE)
real 3m19.717s
user 0m0.001s
sys 0m0.001s

======WITH PATCH=======
[200GB shared_buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_14:19:47.128
dropdb
2018/07/04_14:19:53.177 6.049 sec

[200GB shared_buffers]
DROPDB (included the DROP TABLE and DROP TABLESPACE commands)
real 3m51.834s
user 0m0.001s
sys 0m0.002s

Just in case, do you also have some performance test numbers/case
to show the recovery perf improvement when dropping database that contain multiple tablespaces?

Thanks for testing!

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the *recovery time*
of DROP DATABASE? Since it's *recovery* performance, basically it's not easy
to measure that.

Regards,

--
Fujii Masao

Simon Riggs

simon@2ndquadrant.com

over 7 years ago

In reply to: Fujii Masao (#1)

Re: Recovery performance of DROP DATABASE with many tablespaces

On 4 June 2018 at 17:46, Fujii Masao <masao.fujii@gmail.com> wrote:

Hi,

My colleague encountered the problem that WAL replay took a long time
in the standby with large shared_buffers when he dropped the database
using many tablespaces. As far as I read the code, this happens because
DROP DATABASE generates as many XLOG_DBASE_DROP WAL records as
the number of tablespaces that the database to drop uses,
and then WAL replay of one XLOG_DBASE_DROP record causes full scan of
shared_buffers. That is, DROP DATABASE causes the scans of shared_buffers
as many times as the number of the tablespaces during recovery.

Since the first scan caused by the first XLOG_DBASE_DROP record invalidates
all the pages related to the database to drop, in shared_buffers,
the subsequent scans by the subsequent records seem basically useless.
So I'd like to change the code so that we can avoid such subsequent
unnecessary scans, to reduce the recovery time of DROP DATABASE.

Generally the recovery performance of DROP DATABASE is not critical
for many users. But unfortunately my colleague's project might need to
sometimes drop the database using multiple tablespaces, for some reasons.
So, if the fix is not so complicated, I think that it's worth applying that.

Agreed

The straight approach to avoid such unnecessary scans is to change
DROP DATABASE so that it generates only one XLOG_DBASE_DROP record,
and register the information of all the tablespace into it. Then, WAL replay
of XLOG_DBASE_DROP record scans shared_buffers once and deletes
all tablespaces. POC patch is attached.

Seems clear on read of patch, but not tested it.

Please replace tablespace_num with ntablespaces so its clearer and
consistent with other other WAL records

Cheers

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Fujii Masao (#4)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the *recovery time*
of DROP DATABASE? Since it's *recovery* performance, basically it's not easy
to measure that.

It would be simple to measure the time it takes to replay this single
DROP DATABASE record by putting two gettimeofday() calls or such things
and then take the time difference. There are many methods that you
could use here, and I suppose that with a shared buffer setting of a
couple of GBs of shared buffers you would see a measurable difference
with a dozen of tablespaces or so. You could also take a base backup
after creating all the tablespaces, connect the standby and then drop
the database on the primary to see the actual time it takes. Your patch
looks logically correct to me because DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to see
numbers.
--
Michael

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Michael Paquier (#6)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Tue, Jul 10, 2018 at 03:04:05PM +0900, Michael Paquier wrote:

It would be simple to measure the time it takes to replay this single
DROP DATABASE record by putting two gettimeofday() calls or such things
and then take the time difference. There are many methods that you
could use here, and I suppose that with a shared buffer setting of a
couple of GBs of shared buffers you would see a measurable difference
with a dozen of tablespaces or so. You could also take a base backup
after creating all the tablespaces, connect the standby and then drop
the database on the primary to see the actual time it takes. Your patch
looks logically correct to me because DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to see
numbers.

This was a couple of months ago, and nothing has happened since with the
patch waiting on author, so the patch is marked as returned with
feedback.
--
Michael

Fujii Masao

masao.fujii@gmail.com

over 6 years ago

In reply to: Simon Riggs (#5)

1 attachment(s)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Thu, Jul 5, 2018 at 5:15 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On 4 June 2018 at 17:46, Fujii Masao <masao.fujii@gmail.com> wrote:

Hi,

My colleague encountered the problem that WAL replay took a long time
in the standby with large shared_buffers when he dropped the database
using many tablespaces. As far as I read the code, this happens because
DROP DATABASE generates as many XLOG_DBASE_DROP WAL records as
the number of tablespaces that the database to drop uses,
and then WAL replay of one XLOG_DBASE_DROP record causes full scan of
shared_buffers. That is, DROP DATABASE causes the scans of shared_buffers
as many times as the number of the tablespaces during recovery.

Since the first scan caused by the first XLOG_DBASE_DROP record invalidates
all the pages related to the database to drop, in shared_buffers,
the subsequent scans by the subsequent records seem basically useless.
So I'd like to change the code so that we can avoid such subsequent
unnecessary scans, to reduce the recovery time of DROP DATABASE.

+1

Generally the recovery performance of DROP DATABASE is not critical
for many users. But unfortunately my colleague's project might need to
sometimes drop the database using multiple tablespaces, for some reasons.
So, if the fix is not so complicated, I think that it's worth applying that.

Agreed

The straight approach to avoid such unnecessary scans is to change
DROP DATABASE so that it generates only one XLOG_DBASE_DROP record,
and register the information of all the tablespace into it. Then, WAL replay
of XLOG_DBASE_DROP record scans shared_buffers once and deletes
all tablespaces. POC patch is attached.

Seems clear on read of patch, but not tested it.

Please replace tablespace_num with ntablespaces so its clearer and
consistent with other other WAL records

Thanks for the review! I changed the patch that way. Also I rebased it
on master.
Attached is the latest version of the patch.

I will add this patch to next CommitFest.

Regards,

--
Fujii Masao

Attachments:

xlog_dbase_drop_v2.patchapplication/octet-stream; name=xlog_dbase_drop_v2.patchDownload

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index c7d60ce10d..d08c575872 100644
*** a/src/backend/access/rmgrdesc/dbasedesc.c
--- b/src/backend/access/rmgrdesc/dbasedesc.c
***************
*** 35,43 **** dbase_desc(StringInfo buf, XLogReaderState *record)
  	else if (info == XLOG_DBASE_DROP)
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
  
! 		appendStringInfo(buf, "dir %u/%u",
! 						 xlrec->tablespace_id, xlrec->db_id);
  	}
  }
  
--- 35,46 ----
  	else if (info == XLOG_DBASE_DROP)
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
+ 		int		i;
  
! 		appendStringInfo(buf, "dir");
! 		for (i = 0; i < xlrec->ntablespaces; i++)
! 			appendStringInfo(buf, " %u/%u",
! 							 xlrec->tablespace_ids[i], xlrec->db_id);
  	}
  }
  
diff --git a/src/backend/commands/dbcommands.cindex 01d66212e9..643a539efe 100644
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
***************
*** 1404,1413 **** movedb(const char *dbname, const char *tblspcname)
  		xl_dbase_drop_rec xlrec;
  
  		xlrec.db_id = db_id;
! 		xlrec.tablespace_id = src_tblspcoid;
  
  		XLogBeginInsert();
  		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
  
  		(void) XLogInsert(RM_DBASE_ID,
  						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
--- 1404,1414 ----
  		xl_dbase_drop_rec xlrec;
  
  		xlrec.db_id = db_id;
! 		xlrec.ntablespaces = 1;
  
  		XLogBeginInsert();
  		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
+ 		XLogRegisterData((char *) &src_tblspcoid, sizeof(Oid));
  
  		(void) XLogInsert(RM_DBASE_ID,
  						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
***************
*** 1915,1920 **** remove_dbtablespaces(Oid db_id)
--- 1916,1926 ----
  	Relation	rel;
  	TableScanDesc scan;
  	HeapTuple	tuple;
+ 	List		*ltblspc = NIL;
+ 	ListCell	*cell;
+ 	int		ntblspc;
+ 	int		i;
+ 	Oid		*tablespace_ids;
  
  	rel = table_open(TableSpaceRelationId, AccessShareLock);
  	scan = table_beginscan_catalog(rel, 0, NULL);
***************
*** 1943,1965 **** remove_dbtablespaces(Oid db_id)
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dstpath)));
  
! 		/* Record the filesystem change in XLOG */
! 		{
! 			xl_dbase_drop_rec xlrec;
  
! 			xlrec.db_id = db_id;
! 			xlrec.tablespace_id = dsttablespace;
  
! 			XLogBeginInsert();
! 			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
  
! 			(void) XLogInsert(RM_DBASE_ID,
! 							  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
! 		}
  
! 		pfree(dstpath);
  	}
  
  	table_endscan(scan);
  	table_close(rel, AccessShareLock);
  }
--- 1949,1989 ----
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dstpath)));
  
! 		ltblspc = lappend_oid(ltblspc, dsttablespace);
! 		pfree(dstpath);
! 	}
  
! 	ntblspc = list_length(ltblspc);
! 	if (ntblspc == 0)
! 	{
! 		table_endscan(scan);
! 		table_close(rel, AccessShareLock);
! 		return;
! 	}
  
! 	tablespace_ids = (Oid *) palloc(ntblspc * sizeof(Oid));
! 	i = 0;
! 	foreach(cell, ltblspc)
! 		tablespace_ids[i++] = lfirst_oid(cell);
  
! 	/* Record the filesystem change in XLOG */
! 	{
! 		xl_dbase_drop_rec xlrec;
  
! 		xlrec.db_id = db_id;
! 		xlrec.ntablespaces = ntblspc;
! 
! 		XLogBeginInsert();
! 		XLogRegisterData((char *) &xlrec, MinSizeOfDbaseDropRec);
! 		XLogRegisterData((char *) tablespace_ids, ntblspc * sizeof(Oid));
! 
! 		(void) XLogInsert(RM_DBASE_ID,
! 						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
  	}
  
+ 	list_free(ltblspc);
+ 	pfree(tablespace_ids);
+ 
  	table_endscan(scan);
  	table_close(rel, AccessShareLock);
  }
***************
*** 2166,2173 **** dbase_redo(XLogReaderState *record)
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
  		char	   *dst_path;
! 
! 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
  
  		if (InHotStandby)
  		{
--- 2190,2196 ----
  	{
  		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
  		char	   *dst_path;
! 		int			i;
  
  		if (InHotStandby)
  		{
***************
*** 2197,2207 **** dbase_redo(XLogReaderState *record)
  		/* Clean out the xlog relcache too */
  		XLogDropDatabase(xlrec->db_id);
  
! 		/* And remove the physical files */
! 		if (!rmtree(dst_path, true))
! 			ereport(WARNING,
! 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
! 							dst_path)));
  
  		if (InHotStandby)
  		{
--- 2220,2236 ----
  		/* Clean out the xlog relcache too */
  		XLogDropDatabase(xlrec->db_id);
  
! 		for (i = 0; i < xlrec->ntablespaces; i++)
! 		{
! 			dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
! 
! 			/* And remove the physical files */
! 			if (!rmtree(dst_path, true))
! 				ereport(WARNING,
! 						(errmsg("some useless files may be left behind in old database directory \"%s\"",
! 								dst_path)));
! 			pfree(dst_path);
! 		}
  
  		if (InHotStandby)
  		{
diff --git a/src/include/commands/dbcommindex 46be8a615a..4e8a442c89 100644
*** a/src/include/commands/dbcommands_xlog.h
--- b/src/include/commands/dbcommands_xlog.h
***************
*** 32,41 **** typedef struct xl_dbase_create_rec
  
  typedef struct xl_dbase_drop_rec
  {
- 	/* Records dropping of a single subdirectory incl. contents */
  	Oid			db_id;
! 	Oid			tablespace_id;
  } xl_dbase_drop_rec;
  
  extern void dbase_redo(XLogReaderState *rptr);
  extern void dbase_desc(StringInfo buf, XLogReaderState *rptr);
--- 32,42 ----
  
  typedef struct xl_dbase_drop_rec
  {
  	Oid			db_id;
! 	int			ntablespaces;		/* number of tablespace IDs */
! 	Oid			tablespace_ids[FLEXIBLE_ARRAY_MEMBER];
  } xl_dbase_drop_rec;
+ #define MinSizeOfDbaseDropRec offsetof(xl_dbase_drop_rec, tablespace_ids)
  
  extern void dbase_redo(XLogReaderState *rptr);
  extern void dbase_desc(StringInfo buf, XLogReaderState *rptr);

Fujii Masao

masao.fujii@gmail.com

over 6 years ago

In reply to: Michael Paquier (#6)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the *recovery time*
of DROP DATABASE? Since it's *recovery* performance, basically it's not easy
to measure that.

It would be simple to measure the time it takes to replay this single
DROP DATABASE record by putting two gettimeofday() calls or such things
and then take the time difference. There are many methods that you
could use here, and I suppose that with a shared buffer setting of a
couple of GBs of shared buffers you would see a measurable difference
with a dozen of tablespaces or so. You could also take a base backup
after creating all the tablespaces, connect the standby and then drop
the database on the primary to see the actual time it takes. Your patch
looks logically correct to me because DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to see
numbers.

Thanks for the comment!

I measured how long it takes to replay DROP DATABASE with 1000 tablespaces,
in master and patched version. shared_buffers was set to 16GB.

[master]
It took 8 seconds to replay DROP DATABASE with 1000 tablespaces,
as follows. In this case, 16GB shared_buffers was fully scanned 1000 times.

2019-10-02 16:50:14 JST LOG: redo starts at 0/2000028
2019-10-02 16:50:22 JST LOG: redo done at 0/300A298

[patched]
It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,
as follows. In this case, 16GB shared_buffers was scanned only one time.

2019-10-02 16:47:03 JST LOG: redo starts at 0/2000028
2019-10-02 16:47:03 JST LOG: redo done at 0/3001588

Regards,

--
Fujii Masao

#10

k.jamison@fujitsu.com

about 6 years ago

In reply to: Fujii Masao (#9)

RE: Recovery performance of DROP DATABASE with many tablespaces

On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:

On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the *recovery
time* of DROP DATABASE? Since it's *recovery* performance, basically
it's not easy to measure that.

It would be simple to measure the time it takes to replay this single
DROP DATABASE record by putting two gettimeofday() calls or such
things and then take the time difference. There are many methods that
you could use here, and I suppose that with a shared buffer setting of
a couple of GBs of shared buffers you would see a measurable
difference with a dozen of tablespaces or so. You could also take a
base backup after creating all the tablespaces, connect the standby
and then drop the database on the primary to see the actual time it
takes. Your patch looks logically correct to me because
DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to see
numbers.

Thanks for the comment!

I measured how long it takes to replay DROP DATABASE with 1000 tablespaces,
in master and patched version. shared_buffers was set to 16GB.

[master]
It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as follows.
In this case, 16GB shared_buffers was fully scanned 1000 times.

2019-10-02 16:50:14 JST LOG: redo starts at 0/2000028
2019-10-02 16:50:22 JST LOG: redo done at 0/300A298

[patched]
It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,
as follows. In this case, 16GB shared_buffers was scanned only one time.

2019-10-02 16:47:03 JST LOG: redo starts at 0/2000028
2019-10-02 16:47:03 JST LOG: redo done at 0/3001588

Hi Fujii-san,

It's been a while, so I checked the patch once again.
It's fairly straightforward and I saw no problems nor bug in the code.

[patched]
It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,

The results are good.
I want to replicate the performance to confirm the results as well.
Could you share how you measured the recovery replay?
Did you actually execute a failover?

Regards,
Kirk Jamison

#11

Fujii Masao

masao.fujii@gmail.com

about 6 years ago

In reply to: k.jamison@fujitsu.com (#10)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Wed, Nov 13, 2019 at 3:57 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:

On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:

On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the *recovery
time* of DROP DATABASE? Since it's *recovery* performance, basically
it's not easy to measure that.

It would be simple to measure the time it takes to replay this single
DROP DATABASE record by putting two gettimeofday() calls or such
things and then take the time difference. There are many methods that
you could use here, and I suppose that with a shared buffer setting of
a couple of GBs of shared buffers you would see a measurable
difference with a dozen of tablespaces or so. You could also take a
base backup after creating all the tablespaces, connect the standby
and then drop the database on the primary to see the actual time it
takes. Your patch looks logically correct to me because
DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to see
numbers.

Thanks for the comment!

I measured how long it takes to replay DROP DATABASE with 1000 tablespaces,
in master and patched version. shared_buffers was set to 16GB.

[master]
It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as follows.
In this case, 16GB shared_buffers was fully scanned 1000 times.

2019-10-02 16:50:14 JST LOG: redo starts at 0/2000028
2019-10-02 16:50:22 JST LOG: redo done at 0/300A298

[patched]
It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,
as follows. In this case, 16GB shared_buffers was scanned only one time.

2019-10-02 16:47:03 JST LOG: redo starts at 0/2000028
2019-10-02 16:47:03 JST LOG: redo done at 0/3001588

Hi Fujii-san,

It's been a while, so I checked the patch once again.
It's fairly straightforward and I saw no problems nor bug in the code.

Thanks for the review!

[patched]
It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,

The results are good.
I want to replicate the performance to confirm the results as well.
Could you share how you measured the recovery replay?

I forgot the actual steps that I used for the measurement.
But I think they are something like

1. create database "hoge"
2. create 1,000 tablespaces
3. create 1,000 tables on the database "hoge".
each table should be placed in different tablespace.
4. take a base backup
5. drop database "hoge"
6. shutdown the server with immediate mode
7. start an archive recovery from the backup taken at #4
8. measure how long it takes to apply DROP DATABASE record by
checking the timestamp at REDO start and REDO end.

I think that I performed the above steps on the master and
the patched version.

Did you actually execute a failover?

No.

Regards,

--
Fujii Masao

#12

k.jamison@fujitsu.com

about 6 years ago

In reply to: Fujii Masao (#11)

RE: Recovery performance of DROP DATABASE with many tablespaces

On Wed, Nov 13, 2019 5:34PM (GMT+9), Fujii Masao wrote:

On Wed, Nov 13, 2019 at 3:57 PM k.jamison@fujitsu.com <k.jamison@fujitsu.com>
wrote:

On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:

On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier <michael@paquier.xyz>

wrote:

On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the
*recovery
time* of DROP DATABASE? Since it's *recovery* performance,
basically it's not easy to measure that.

It would be simple to measure the time it takes to replay this
single DROP DATABASE record by putting two gettimeofday() calls or
such things and then take the time difference. There are many
methods that you could use here, and I suppose that with a shared
buffer setting of a couple of GBs of shared buffers you would see
a measurable difference with a dozen of tablespaces or so. You
could also take a base backup after creating all the tablespaces,
connect the standby and then drop the database on the primary to
see the actual time it takes. Your patch looks logically correct
to me because DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to
see numbers.

Thanks for the comment!

I measured how long it takes to replay DROP DATABASE with 1000
tablespaces, in master and patched version. shared_buffers was set to

16GB.

[master]
It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as follows.
In this case, 16GB shared_buffers was fully scanned 1000 times.

2019-10-02 16:50:14 JST LOG: redo starts at 0/2000028
2019-10-02 16:50:22 JST LOG: redo done at 0/300A298

[patched]
It took less than 1 second to replay DROP DATABASE with 1000
tablespaces, as follows. In this case, 16GB shared_buffers was scanned

only one time.

2019-10-02 16:47:03 JST LOG: redo starts at 0/2000028
2019-10-02 16:47:03 JST LOG: redo done at 0/3001588

Hi Fujii-san,

It's been a while, so I checked the patch once again.
It's fairly straightforward and I saw no problems nor bug in the code.

Thanks for the review!

[patched]
It took less than 1 second to replay DROP DATABASE with 1000
tablespaces,

The results are good.
I want to replicate the performance to confirm the results as well.
Could you share how you measured the recovery replay?

I forgot the actual steps that I used for the measurement.
But I think they are something like

1. create database "hoge"
2. create 1,000 tablespaces
3. create 1,000 tables on the database "hoge".
each table should be placed in different tablespace.
4. take a base backup
5. drop database "hoge"
6. shutdown the server with immediate mode 7. start an archive recovery from
the backup taken at #4 8. measure how long it takes to apply DROP DATABASE
record by
checking the timestamp at REDO start and REDO end.

I think that I performed the above steps on the master and the patched version.

Did you actually execute a failover?

No.

I'm sorry for the late reply, and thanks for the guide above.
I replicated the same recovery test above on a standalone server
and have confirmed with the logs that the patch made the recovery faster.

[MASTER/UNPATCHED] ~10 seconds
2019-11-19 15:25:23.891 JST [23042] LOG: redo starts at 0/180006A0
...
2019-11-19 15:25:34.492 JST [23042] LOG: redo done at 0/1800A478

[PATCHED] ~less than 1 sec
2019-11-19 15:31:59.415 JST [17625] LOG: redo starts at 0/40005B8
...
2019-11-19 15:32:00.159 JST [17625] CONTEXT: WAL redo at 0/4000668 for Database/DROP: dir 1663/16384 16385/16384...//further details ommitted//...
...
2019-11-19 15:32:00.159 JST [17625] LOG: redo done at 0/4001638

I believe there are no problems, so I am marking this patch now
as "Ready for Committer".

Regards,
Kirk Jamison

#13

Fujii Masao

masao.fujii@gmail.com

about 6 years ago

In reply to: k.jamison@fujitsu.com (#12)

Re: Recovery performance of DROP DATABASE with many tablespaces

On Tue, Nov 19, 2019 at 3:39 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:

On Wed, Nov 13, 2019 5:34PM (GMT+9), Fujii Masao wrote:

On Wed, Nov 13, 2019 at 3:57 PM k.jamison@fujitsu.com <k.jamison@fujitsu.com>
wrote:

On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:

On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier <michael@paquier.xyz>

wrote:

On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the
*recovery
time* of DROP DATABASE? Since it's *recovery* performance,
basically it's not easy to measure that.

It would be simple to measure the time it takes to replay this
single DROP DATABASE record by putting two gettimeofday() calls or
such things and then take the time difference. There are many
methods that you could use here, and I suppose that with a shared
buffer setting of a couple of GBs of shared buffers you would see
a measurable difference with a dozen of tablespaces or so. You
could also take a base backup after creating all the tablespaces,
connect the standby and then drop the database on the primary to
see the actual time it takes. Your patch looks logically correct
to me because DropDatabaseBuffers is a
*bottleneck* with large shared_buffers, and it would be nice to
see numbers.

Thanks for the comment!

I measured how long it takes to replay DROP DATABASE with 1000
tablespaces, in master and patched version. shared_buffers was set to

16GB.

[master]
It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as follows.
In this case, 16GB shared_buffers was fully scanned 1000 times.

2019-10-02 16:50:14 JST LOG: redo starts at 0/2000028
2019-10-02 16:50:22 JST LOG: redo done at 0/300A298

[patched]
It took less than 1 second to replay DROP DATABASE with 1000
tablespaces, as follows. In this case, 16GB shared_buffers was scanned

only one time.

2019-10-02 16:47:03 JST LOG: redo starts at 0/2000028
2019-10-02 16:47:03 JST LOG: redo done at 0/3001588

Hi Fujii-san,

It's been a while, so I checked the patch once again.
It's fairly straightforward and I saw no problems nor bug in the code.

Thanks for the review!

[patched]
It took less than 1 second to replay DROP DATABASE with 1000
tablespaces,

The results are good.
I want to replicate the performance to confirm the results as well.
Could you share how you measured the recovery replay?

I forgot the actual steps that I used for the measurement.
But I think they are something like

1. create database "hoge"
2. create 1,000 tablespaces
3. create 1,000 tables on the database "hoge".
each table should be placed in different tablespace.
4. take a base backup
5. drop database "hoge"
6. shutdown the server with immediate mode 7. start an archive recovery from
the backup taken at #4 8. measure how long it takes to apply DROP DATABASE
record by
checking the timestamp at REDO start and REDO end.

I think that I performed the above steps on the master and the patched version.

Did you actually execute a failover?

No.

I'm sorry for the late reply, and thanks for the guide above.
I replicated the same recovery test above on a standalone server
and have confirmed with the logs that the patch made the recovery faster.

[MASTER/UNPATCHED] ~10 seconds
2019-11-19 15:25:23.891 JST [23042] LOG: redo starts at 0/180006A0
...
2019-11-19 15:25:34.492 JST [23042] LOG: redo done at 0/1800A478

[PATCHED] ~less than 1 sec
2019-11-19 15:31:59.415 JST [17625] LOG: redo starts at 0/40005B8
...
2019-11-19 15:32:00.159 JST [17625] CONTEXT: WAL redo at 0/4000668 for Database/DROP: dir 1663/16384 16385/16384...//further details ommitted//...
...
2019-11-19 15:32:00.159 JST [17625] LOG: redo done at 0/4001638

I believe there are no problems, so I am marking this patch now
as "Ready for Committer".

Thanks for the review! Committed.

Regards,

--
Fujii Masao